Who Really Wrote the Works of the British Renaissance? thread 2

Esto es una continuación del tema Who Really Wrote the Works of the British Renaissance?.

CharlasTalk about LibraryThing

Únete a LibraryThing para publicar.

1amanda4242
mayo 3, 2022, 2:20 pm

Copied and pasted from AbigailAdams26 introductory post https://www.librarything.com/topic/337240#7670449

Hi All: I want to draw your attention to a new interview we've just posted on the LibraryThing blog, with scholar Anna Faktorovich, in which we discuss her new project, the British Renaissance Re-Attribution and Modernization Series.

ER participants will recognize this name, as the author's main study, as well as some of the newly available texts from the period that she has presented in the series, were offered as giveaways during this past month's ER batch.

Dr. Faktorovich discusses her new computational-linguistic model, and how she has used it in the study of 284 works of the British Renaissance, coming to the conclusion that these works were written by the same six ghostwriters.

Come read her argument, and tell us what you think!

https://blog.librarything.com/2021/12/an-interview-with-scholar-anna-faktorovich....

2amanda4242
Editado: mayo 3, 2022, 2:27 pm

Summary of the previous thread

LT posts an interview with an author who claims she's proven that most of the works of the British Renaissance were written by six ghostwriters. The author reacts poorly when everyone says she's jumped to conclusions based on suspicious data generated by faulty methodology.

And since I'm sure this summary will draw comments:

Vota: Do you believe this is a fair summary of the previous thread?

Recuento actual: Sí 49, No 1

3Petroglyph
Editado: mayo 3, 2022, 2:55 pm

>1 amanda4242:
Thanks! Loading times were getting rather cumbersome. Out with the old, in with the new!

I saw you'd opened a continuation thread, but I pulled a sneaky one and posted one final tedious Petroglyph comment anyway. I'm done explaining basic data management to someone who won't hear it, but it felt fitting to post the last-ditch effort in the old thread.

>2 amanda4242:
I would add something to that description along the lines of "a secretive Workshop of six ghostwriters", or even "most of the works of the British Renaissance, including the plays of Shakespeare, the letters of Queen Elizabeth I and the King James Version". But then again, I'm pedantic that way.

Edit: Poor Crypto-Willobie. Or perhaps now they get to try and get a whole new set of significant digits.

4amanda4242
mayo 3, 2022, 2:56 pm

>3 Petroglyph: I wonder if there's a collective noun for ghostwriters that I could employ. A scribble? A cabal? Oh! How about a haunt of ghostwriters?

5lilithcat
mayo 3, 2022, 3:01 pm

>4 amanda4242:

How about a séance of ghostwriters?

6paradoxosalpha
mayo 3, 2022, 3:08 pm

I believe the prescribed term is "workshop."

But I'd vote for "cramp."

7Petroglyph
mayo 3, 2022, 4:11 pm

A Communion of ghostwriters?

8abbottthomas
mayo 3, 2022, 6:08 pm

I do rather regret this continuation. There was something uniquely appalling about the original thread that was somehow enhanced by the increasingly long time it took to load.

92wonderY
mayo 3, 2022, 6:26 pm

I was hoping it would die the death it deserved.

10faktorovich
mayo 3, 2022, 10:09 pm

I guess you have conspired within your Cabal to start this new thread with a volley of insults (that are failing to appear humorous) tossed in my direction. Yes, the term I apply to the Renaissance's six underlying authors is the Ghostwriting Workshop because this is the term that was used during the Renaissance to describe collaborative writing clubs (as I explain in BRRAM's Volumes 1-2).

And this statement is incorrect: "The author reacts poorly when everyone says she's jumped to conclusions based on suspicious data generated by faulty methodology." A more accurate summary would be: "The author presents accurate and precise data and an entirely new attribution method, both of which prove all previous attribution studies of the Renaissance to be fraudulent or faulty in their data and methods. The authors of some of these previous faulty attribution studies (writing under pseudonyms, and thus without acknowledging their bias) react poorly towards being discovered to have erred, and so they attack the author with insults and nonsensical and cyclical anti-arguments intended to disqualify the findings without actually finding any real fault with them."

11Petroglyph
mayo 3, 2022, 10:35 pm

I asked Faktorovich to show me "Verstegan's authorial signature? And Percy's? And Jonson's? All six of them, in fact. " As well as the authorial signatures as they emerged from your Lunch Break Experiment (tm).

Faktorovich wrote: "I narrate what the six Renaissance signatures signify across the BRRAM series {...} lengthy chapters of explanation. Can you clarify what you are trying to ask here, if you are not asking me where to find my data?"

The data I am asking for is not something you have provided in one convenient spot; rather it is spread out across several volumes.

You test texts for various measures: full stops, question marks, Auxiliary verbs, % passive voice, syllables per word, lexical density, etc, and the unquestionably apophenic nonsense that is your "patterns" of the 6 most frequent letters and words.

You claim to be able to recognize an author's signature as a result of 27 of your tests (with more added apophenia in the "most frequent three-word phrases".

So. There has to be a certain range of full stops that you think is typical for Percy and, and one that you think is typical for Ben Jonson's. There has to be a unique combination of outcomes for these 27 tests that allows you to say "Sylvester wrote this".

I want to see the "authorial signature" for Percy and Jonson and Verstegan and the others, all in one place.

I want to see a table with a header row and with six data rows, one for each of your six ghostwriters. The columns are the 27 tests. In each data row, I want you to put the range of scores that are "typical" for each author for that test. So in the Percy row, for, say Auxiliary Verbs, I want you to put the range: put something like "3.2 - 7.5" in that cell -- the typical range in which Percy's use of auxiliary verbs falls. The typical Percy number for % Passive voice, the signature Percy range for syllables per word.

I want you to do that for all of the members of your secret workshop. And, if you can, also for Austen and the Brontës and Marie Corelli. But mainly I want the workshop signatures.

There are scattered remarks in your book about how the a-pattern is favoured by author X, and some other pattern is favoured by author Y, and so on. But I want to see all of that in one table.

You should be able to test a renaissance text that is not in your corpus, and, comparing the results for that text with the authorial signatures in that table, match that text to its putative author(s) with a certain probability.

I would think that such a table would be the crowning achievement of your method.

12prosfilaes
mayo 3, 2022, 10:43 pm

>10 faktorovich: The authors of some of these previous faulty attribution studies (writing under pseudonyms, and thus without acknowledging their bias) react poorly towards being discovered to have erred,

I don't really know what to say. It's always a conspiracy, instead of a bunch of random people disagreeing with you because you didn't convince them.

13Keeline
mayo 3, 2022, 10:54 pm

>4 amanda4242: The Stratemeyer Syndicate employed a lot of ghostwriters so you could use

a syndicate of ghostwriters

James

14Petroglyph
Editado: mayo 4, 2022, 12:57 am

Replying to this Faktorovich post #1151

Great Cthulhu in the deep below, I was right. She uses the red font colour way of counting because counting her data properly (i.e. the yellow highlighter way) would return absolute garbage as result.

"Many of my tests return unusably small ranges and test for features that are used too sparsely to meaningfully separate authors. So I need to change the way I'm counting so I can keep using the tests."

the exclamations test is frequently near-zero or zero for a significant portion of texts. But there are some that have high exclamation values. If you are measuring 18% of the highest value in the range, you might judge not only all 0's, but also most of the other exclamation values to be similar to each other or a 1 (if a text falls in this group), whereas only a few outliers near your highest-value would be dissimilar or 0.

and:

Then, you have an epiphany that I have "some very narrow ranges" in some of the 27 tests, such as Adjectives, where it is a range of 2 points between the minimum and maximum. If I measured Adjectives by calculating 18% from any value at such a close curve, it would increase the number of matches for each test, and it would dramatically increase the percentage of tests on which texts would match. Just as the exclamations are clustered around 0, Adjectives also have a clustering point around which most of the texts are within 18% of each other.

If a test returns results that are concentrated within an unusably narrow band or that are too erratic to be systematically meaningful, then a proper scholar would conclude that testing for that particular feature is, in fact, not meaningful, and would remove it from consideration in favour of a better test.

You, however, continue to use the meaningless test and you devise a ridiculous way of counting, specifically so you can keep the meaningless test around and base your revolutionary new method on it and several others like it.

I'll say that again: Your results are based (in part) on tests that return results too erratic or too narrow to be useful in separating out authors. And you've twisted the way you count these results in order to keep including the meaningless tests instead of discarding them for the useless results they are.

Garbage in, garbage out.

So by following your method, I would be engaging a trick that would make it seem as if a lot more texts are similar to each other. But the problem with this approach is that while the total "matches" would go up, the number of false-positives would also increase. My goal is not to see the highest possible percentile match, but rather the most accurate possible results, and this is achieved by just counting a percentage of texts and not the percentage between their numeric outputs.

It's not a "trick". It's revealing those tests as unfit for the purpose you want them to fulfill.

Your method can only work in a corpus with a single test-type, such as word-frequency, as in this method you might be able to choose a specific percentage that would make sense. Because I am combining the outputs of 27 different tests, I cannot use any standard percentage cut-off for all tests."

What? This is nonsense. You use the ~9% above + ~9% below as based on corpus size rule consistently (the red colour rule); you could just as consistently apply the yellow highlighting rule. And if you chose sensible tests, whose results would not return ranges with insignificant differences between min and max, you wouldn't have to resort to creative data counting.

15prosfilaes
mayo 3, 2022, 11:32 pm

>1151 faktorovich: (from the previous thread):
Consider this experiment, you are trying to figure out which out of a group of children are most likely to be the sons of which fathers. You don't have access to DNA testing, so you are sorting them by obvious traits like hair color and eye color etc.

I.e. you're ignoring everything we know about the inheritance of traits like hair color and eye color in humans.

You create a table for each of these tests. Let's look at the test for eye color, you have chosen to register specific colors instead of just sorting the kids into blue, brown etc. categories. Perhaps you happen to have a pool of children that mostly have variants of the brown eye color. If you use the 18% measure as a cut-off while placing children on the rainbow spectrum, you might have up to 95% kids falling closer to the brown color than to the other colors. You have thus made this test statistically useless because it is not registering the variations between brown that your careful photographing of eye color and separation of it on the full spectrum was designed to provide. If you separate the children on this spectrum and then choose the 18% of children that are closest in eye color to a tested father; then, you will receive the specific answer to the question which of the children are similar in eye-color to each of the potential fathers. Do you have any questions about this?

So you're presuming you have a complete list of fathers upfront. That seems problematic in reality, and in analogy.

Secondly, you seem to be putting getting an answer above getting a correct answer. If you match the kid with blue-grey eyes with the father with blue-green eyes, then any of the brown eyed kids should match with any of the brown eyed fathers. Even without looking things up, I'd expect fine details of eye color to be based off of environmental factors, making any fine-grained comparisons meaningless. Note that in your literary comparisons, that would sweep up outlying works into a small set of ghostwriters, instead of recognizing outlying works as possibly being the only work by an author.

16Keeline
mayo 3, 2022, 11:55 pm

For anyone who is confused by memes, this is a resource:

https://knowyourmeme.com/

James

17Petroglyph
mayo 4, 2022, 2:35 am

Here is another report on the silliness and nastiness that is Faktorovich's mind-bogglingly wrong-headed book Re-attribution.

While I think that the paragraph I want to highlight in this post is a particularly clear example of the hopelessly convoluted mess that is Faktorovich's thinking, I do realise that this may come across as yet more overexplained tedium that's not worth your time. Which is fair enough. I'll cover the annoying bits with spoiler tags.

In my defense, I can only say that I wanted to use this paragraph in my review, but then the fact-checking got out of hand, and now I have a whole bunch of notes about just one single paragraph of distilled stupidity, nastiness and just sheer wrongness, and I didn't want them to go to waste. So here they are.

1. Necessary background

If you want to understand the absolute dog's breakfast that Faktorovich makes of things, I need to explain a few things first, so here's a list. I've tried to put the critical information first, and the expanded background behind spoiler tags.

Some vocabulary: hapax legomena: words that occur only once in a corpus
the proportion of hapax legomena in a newspaper article can be as high as 75%; the proportion of hapax legomena in one corpus from the 1970s was 40%
Kennedy (1998, p. 100)says the following:
"In a short newspaper item of about 200 words, as many as 150 words may be hapax legomena. In large representative corpora, the proportion of hapaxes is normally much smaller. For example, in the American Heritage Intermediate Corpus (Carroll, Davies & Richman, 1971), which is an unlemmatized corpus of 5,088,721 words, 35,079 of the 88,741 different types are hapaxes (39.5%). Nevertheless, the fact that almost 40% of the words in a corpus of over five million words occur only once shows that a corpus of even that size is not a sound basis for lexicographical studies of low frequency words."
In one particular 10-million word corpus, the top 2000 most frequent words accounted for 80% of all the words in the corpus
O'Keeffe et al. (2007, p. 32) talk about "a mixed corpus of ten million words of English (made up of five million words of spoken data, from CANCODE, and five million words of written data taken from the Cambridge International Corpus, CIC)". When talking about this graph (image link), O'Keeffe et al. say this:
If we examine the frequency of words in a large corpus of English, a picture emerges where the first 2,000 or so word-forms do most of the work, accounting for more than 80% of all of the words in spoken and written texts. As we progress down the frequency list, each successive band of 2,000 words covers a progressively smaller proportion of all the words in the texts in the corpus, with many words occurring only a small number of times or, indeed, only once. (2007, p. 32)
Jackson, in his 2003 book, tries to assess how the play Pericles fits within the Shakespearian body of work. At some point he finds it useful to looks at words occurring with a frequency between 2 and 10.
Jacson sees a separation in vocabulary between Pericles Acts 1-2 and Pericles Acts 3-5. I do not have access to a digital version of this book, so I'm forced to go by what Faktorovich says about it. Why Jackson looks at vocab that has a frequency of between 2 and 10 -- that is to say, excluding the hapax legomena (they occur only once and so are too unreliable to base much on), and excluding those words that have very, very high frequencies. My guess is that Jackson aimed to look at mainly content words, and excluded the really rare ones and the high-frequency function words. But like I said, I don't have access to the Jackson book, so don't quote me on that one. In general, though, taking multiple looks at your data through multiple lenses, and isolating certain variables is a good idea.

2. The headache

Alright. with these preliminaries out of the way, I present to you today's headache-inducing Faktorovich quote (here is a screenshot of this quote in a larger context):

Thus, by narrowing the frequency to between 2 and 10 occurrences, Jackson excludes both the most frequent words (which are likely to occur more than 10 times per play: 80%), and the hapax (single-occurrence: 40-75%) words; thus, it is possible that all words (or 80+75 = 155%) in a given sample can be excluded by choosing these parameters. These statistical absurdities can be avoided by comparing only the most frequently occurring letters, words and phrases, as these are representative of the common elements in authorial style. (Re-attribution, p, 475

The thrust of Faktorovich's drivel is that she takes issue with Jackson's (2003) analysis of the play Pericles. Her disagreement takes the form of a particularly juicy befuddlement as she takes a few unconnected corpus statistics she's taken from introductory textbooks and blends them together in an unholy, anti-mathematical mess that culminates in the nasty, paranoid allegation that Jackson has chosen his frequency range of between 2 and 10 words just so he can exclude all the words in a text and run his analyses on hand-picked words. Aside from conflating "frequency higher than 10" with "covering 80% of the words in a corpus", and that thing where she assumes that that 80% + 75% = 155% happens in other people's minds, too, there's just so much unthinking blather here that it requires a lot of context just to disentangle.

There's much more nonsense in the screenshot of that passage (e.g. assuming that what's "confusion-generating" for her is also confusing for others; 2% of 10,000,000 is 2000??; conflating her renaissance corpus with a properly balanced representative corpus). But that's for the diehard drama aficionados.

3. Why bring up so much background for a paragaph of nonsense?

Why all this? Well, the fact-checking kinda got away from me.

But there's a larger point to be made here, too.

The paragraph I quoted is part of Faktorovich's uncomprehending discussion of Jackson's arguments, which are a large part of an entire chapter on Pericles (pp. 462-180). Look at the last sentence of that paragraph I quoted above:

These statistical absurdities can be avoided by comparing only the most frequently occurring letters, words and phrases, as these are representative of the common elements in authorial style

In other words: "other researchers nonsense; my method true."

This is one of the most constant, repetitive drumbeats to Faktorovich's train of thought. The rhetorical point of that paragraph, like the rhetorical point of so much else of her drivel, is to present her own method as superior by comparison. All this ignorant and accusatory meanness serves the goal of making her own method look not just better, but the only correct one. (Queen Gertrude might say something about the lady protesting too much.)

And there's just a constant barrage of paranoid shit-flinging. In the two and a half pages leading up to that paragraph I highlighted, Faktorovich is very diligent in accusing Jackson of, what else, BiAs and dAtA mAnIpUlATiON. No fewer than seven explicit accusations in the span of two and a half pages:

"It is likely that Jackson selected these links with an undisclosed bias if{, in Faktorovich analysis, Ben} Jonson’s light touch came across as the dominant signature" (p. 473)
"Jackson does not provide the 377 words that led to this erroneous conclusion; this non-disclosure prevents an audit of the cause of this strangely high match between these texts" (p. 473)
"all words in these texts and not only handpicked ones have to be measured to determine if these echoes indicate a greater contribution from one of these ghostwriters." (p. 474)
"Jackson’s findings can be double-checked more fully when he discloses a few revealing elements of his chronological analysis" (p. 474)
"Perhaps in an attempt to complicate audits, Jackson did not include these dates, so these had to be derived by checking first-performance data" (p. 474)
"Jackson then employs a summary table (again without the raw data) to demonstrate that an unknown combination of these early plays is significantly more similar to Pericles’ Acts 3-5 than to 1-2. Since my tests indicate that Jonson had a slightly greater contribution to Acts 3-5 of Pericles than to 1-2, this earliest category should be a closer match to Acts 1-2." (p. 474)
"Jackson’s data indicates a complete reversal in linguistic attribution that defies the rules of statistical-logic" (p. 474)
And the "statistical absurdities" smear discussed above (p. 475).

It continues like that in the subsequent pages, too. It never ends.

Faktorovich, uncomprehendingly lashing out at actual scholars, throws bad-faith accusations and paranoid strawmen at this study, casting Jackson as one of the mean, nasty academicses performing statistical mumbojumbo to manipulate their results and hide the truth. The Truth, which has only been uncovered by her own idiosyncratic research that bears all the hallmarks of a conspiracy theory.

It's bluster. Mean-spirited bluster. Strawmanning with dishonest intent.

4. I want this to conclude!

Ugh. The whole book is like that. It's a never-ending assault on quality work and good taste and there's no end to it. Like, look at the length of these paragraphs!

By Grabthar's Hammer, you guys. The stupidity is just so concentrated and so relentless! This entire book is a dense, mis-argued, mean-spirited muddle of anger and confusion. A dogged insistence on an extremely literalist and context-less reading of some half-understood corpus linguistics, and she'll use the resulting mess to discredit someone else's words and make her own look better by comparison.

Writing about this book is a deeply unpleasant experience.

References

Faktorovich, Anna. 2021. The Re-Attribution of the British Renaissance Corpus. 1st ed. British Renaissance Re-Attribution and Modernization Series 1–2. Quanah: Anaphora Literary Press.

Jackson, MacDonald P. 2003. Defining Shakespeare: Pericles as Test Case. Oxford: Oxford University Press.

Kennedy, Graeme. 2014. An Introduction to Corpus Linguistics. New York: Routledge.

O’Keeffe, Anne & Michael McCarthy, Ronald Carter. 2007. From Corpus to Classroom: Language Use and Language Teaching. Cambridge: Cambridge University Press.

18anglemark
Editado: mayo 4, 2022, 11:24 am

>17 Petroglyph: I have electronic access to Jackson's book – did Faktorovich provide any page numbers for the claims she makes about Jackson's methodology, at the bottom of page 474? I think that when she quotes Jackson as distinguishing between words "occurring 2-6 times" and "occurring 2-10 times", she refers to Jackson's discussion of Eliot Slater's research, and Jackson's own method which builds on Slater:

"... Inevitably, the section that follows contains a forbidding mass of figures and cites statistical tests of significance. But the basis for these is the simple rule that works written by the same author at about the same time are apt to have more of their low-frequency words in common than works whose dates of composition are separated by many years.

Eliot Slater compiled card-indexes recording all instances of words that occur in at least two Shakespeare plays but not more than ten times altogether. In a series of articles and a University of London doctoral dissertation, published posthumously as a book, he demonstrated that the vocabulary of any one Shakespeare play tends to be linked most closely with the vocabulary of other plays that he wrote within the same period. Likeness or unlikeness of subject matter and genre may increase or decrease the degree of linkage, but the chronological factor is dominant, or, at the very least, considerable. Slater’s method was to compare, by the use of chi-square tests, the actual number of word links between plays with the number of word links to be expected were the distribution purely random. He calculated ‘expected’ figures on the basis of the relative size of each play’s total vocabulary, but expectations may more appropriately be derived from the proportion of the total number of link words (those occurring 2–10 times in the dramatic canon) that are present in each play. Expected figures worked out in this way differ only slightly from those produced by Slater’s more approximate procedure: overall patterns remain unaffected, but a few marginally significant associations between plays disappear when the more precise measure is used. The analyses that follow, though accepting Slater’s figures for actual links, employ my own more precisely computed expected figures." (Jackson 2003:40-41, referring to Slater, Eliot. (1988). The Problem of ‘The Reign of King Edward III’: A Statistical Approach Cambridge: CUP)

And a bit further down:
"If we confine analysis to the rarer vocabulary—words occurring 2–6 times in the canon—the predominance in Pericles, 3–5, of links with late plays persists. The most significant associations are now with Pericles, 1–2 (8 links, 3.58 expected, chi-square 5.46), The Tempest (16: 9.49, chi-square 4.47), and Timon of Athens (16: 9.7, chi-square 4.09). For Pericles, 1–2, on the other hand, the pattern changes. The significant linkage with Antony and Cleopatra and Coriolanus disappears, The Merchant of Venice replaces its near-contemporary 1 Henry IV as a play with marginally significant associations with Pericles, 1–2 (12:6.6, chi-square 4.42), and the truly significant relationships are with Titus Andronicus (15:7.97, chi-square 6.2) and 1 Henry VI (19: 9.64, chi-square 9.09)." (Jackson 2003:43)

Edited to add: Jackson doesn't by any means uncritically accept Slater's method – see also
Jackson, MacDonald P. (2015). Vocabulary links between Shakespeare's plays as a guide to chronology: a reworking of Eliot Slater's tables. Shakespeare, 11(4), 446-458. DOI: 10.1080/17450918.2014.985604.

-Linnéa

19faktorovich
mayo 4, 2022, 10:55 am

>11 Petroglyph: As I explained in a previous post, the "Lyly"-assigned and Percy-ghostwritten "Sapho and Phao" matches 17 Percy-group plays, and no texts from any of the other linguistic-groups among the other 267 texts. This is equivalent to rolling "1" on the dice every one of the 17 times you bet on "1" at the casino, and in the same game rolling "3" every one of the other 267 times you bet on "3". If you had those 100% odds of winning at a casino, would you call them "random"?

The "most frequent three-word phrases" test is not one of the quantitative 27-tests, but rather is the 28th test that is not part of the statistical attribution conclusion; it is used to check if it matches the overall conclusion, and to research further what the attribution signifies.

I began with a corpus of around 100 texts, and gradually expanded it to 284 texts. With each text I have added, the "signature" for the author of the added text mutated slightly to accommodate its dimensions. For example, the artistic "style" of Van Gogh is defined by analyzing his paintings across his career, and reaching a conclusion about the stages and variants he was capable of. If we had only 2 surviving Van Gogh paintings, the art world's understanding of his style would have been very different than it is for the full range that we actually have. Even if changes are small from each added text, it would statistically inaccurate to give you the specific ranges for the current texts in the corpus, as they might change by a bit from an expansion of the corpus. Additionally, as I explained, there is a lot of collaboration across this corpus. Jonson and Percy, Verstegan and Harvey, Sylvester and Harvey, and and some of these with Byrd frequently co-wrote texts. The co-written texts are grouped with the dominant ghostwriter, but co-written texts have a percentage of their signature that fits with the co-writer's linguistic range, and not with the dominant hand's. Thus, probably only completely or mostly single-authored texts should be considered when establishing a linguistic range for a specific author. Even if such a range is derived, if you choose a new text from the Renaissance and compare it to this pure single-author range, if the text in question is co-authored, it would fall between two pure ranges. This is why it would be a mistake to start with the range and check works against it. You really have to go through the full 27-tests process to reach a precise attribution conclusion. If you want to know the current ranges, or to experiment with them, the data is available on my GitHub, so you can go ahead and check for yourself what the ranges are. The data is already in one place. No computational-linguistic study I have looked at has claimed to know or to list the precise signature-range for any given byline.

Why are you requesting me to create a table for you with specific dimensions etc. that you already have in mind. I posted the data on GitHub so that anybody can manipulate or experiment with the data. So just manipulate it and come up with the signature ranges that you want to see. Yes, I comment on the a-pattern and several other more complex patterns across the series. The data tables on GitHub include all of this data already. You can re-order the table for the 27-tests to see which character/ word patterns appear in which of the texts and groups. My summary of these patterns in a single table would be simplistic and would lack all of the fine points in the complete set of data. The most-common pattern for a given signature proves the existence of the signature style, but the second and third most-common patterns also communicate points in the larger narrative about the components of an author's style.

Nothing I have said so far in the overwhelming amount of evidence I have provided has convinced you, but you are saying that a summary table with a bunch of signature-defining statements would convince you? Why do you want to see such a table?

20faktorovich
mayo 4, 2022, 10:58 am

>13 Keeline: Again, the ghostwriters in the Renaissance called themselves a Workshop (and "ghost-writers"), so that's why I called them the Ghostwriting Workshop.

21faktorovich
mayo 4, 2022, 11:05 am

>14 Petroglyph: The data might appear to be in "an unusably narrow band" only if the 18% in its concentrated data-peak is compared with the data in the outlying ends of the range. But when the numbers are compared as individual texts, there is an identifiable pattern that spots the different authorial signatures between the underlying authors.

The tests I choose for a given corpus are the most suitable ones for it that register the stylistic preferences of the underlying author. The consistent accuracy of my results prove that I have chosen the correct tests for the Renaissance.

22faktorovich
mayo 4, 2022, 11:17 am

>15 prosfilaes: If it helps, imagine that there has been a mix-up at the fertility clinic and the samples from a known list of fathers have been mixed up, so you know who the fathers are, but not whom among the children they fathered. Alternatively, you can imagine there are only 2,000 sperm donors (equivalent to the no more than 2,000 or so British authorial bylines from this century), and you have to check the 284 children against each of these donors characteristics tables.

If you are concerned eye-color alone is insufficient to achieve a match, this is exactly what I believe as well. This is why I combine 27 different tests, whereas Stylo/ the accepted method in computational-linguistics is to only measure word-frequency, or only 1 test (equivalent to only measuring eye-color). And when you consider that eye-color might not be exactly matched between a father and a child, you are considering the impact of the mother's eye-color-characteristics; and this is indeed another element my approach helps to reveal, as it is sensitive to co-authorship and spots texts with two parents as opposed to clones with a single parent, etc. Environmental factors are things like vocabulary-range for an author, and these define a linguistic style (as we are discussing linguistics and not eye-color). And as long as I considered the full range of 2,000 potential donors (as I did), there is no exclusion of "outlying works" in favor of any popular byline. If I had not reviewed a broad corpus of potential authors, I would not have even considered Percy, Harvey, Verstegan, Byrd or Sylvester as none of these authors have ever been proposed before as alternative "Shakespeares" or as the "real" authors behind any other anonymous or pseudonymous Renaissance texts. Only by reviewing thousands of potential bylines did I eventually come across these specific donors that fit the unique characteristics of the children (and the timing of the donations).

23anglemark
Editado: mayo 4, 2022, 11:46 am

>22 faktorovich: "This is why I combine 27 different tests, whereas Stylo/ the accepted method in computational-linguistics is to only measure word-frequency, or only 1 test"

Are you simply making things up as you go? Stylo is (as has been explained to you) not a method, it is a tool. 27 tests is (as has been explained to you) a very small number indeed, and back in post 196 in the previous thread you got a list of (at least) several hundred potential tests possible to carry out using Stylo.

-Linnéa

24faktorovich
mayo 4, 2022, 11:51 am

>17 Petroglyph: The quote from my "Re-attribution", p. 475 and the surrounding information in that chapter of BRRAM explained exactly what your preceding introductory comments pointed out, but you did not give me credit for explaining it. I did not make any mistakes of believing 2% of 10 million is 2,000; if I had you would have cited a specific quote where I said anything of this sort. My statements in this section are very accurate and logical; only your nonsensical digressive discussion about it is illogical and mistaken.

Your paraphrasing of my "very diligent... accusing Jackson of bias and data manipulation" appears to be accurate. Yes, I accused him of both of these things, and the quotes explain my objections with evidence that this is indeed the case. You do not provide any evidence to contradict my findings that Jackson and other computational-linguists I review across BRRAM are not biased/ in error. Just because you yell back at me a quote of my rational conclusions and argue that the yelling makes them sound insane, it does not mean you have convinced anybody of your emotion-based explosive reaction, but rather you have betrayed you have no logical argument to make to support your generally antagonistic position.

By the way, you broke copyright law when you posted images of 5 pages from Volumes 1-2. You are allowed to transcribe quotes (up to a paragraph on a topic) from books under "fair use", but you are not allowed to publish 5 pages online for the public to access. And the images of my design/ formatting are not covered in "fair use", which is only designed for researchers who need to communicate an idea. For example, if I had published a book of my paintings, and you had scanned and published these paintings, you would be in clear violation of copyrights law. One clue you are failing to employ a text in "fair use" is when you do not actually read any of the pages you are illegally publishing. This is the first time you have hinted you read any part of BRRAM other than the summary of the basic method, and you are only "reading" a paragraph, while illegally publishing 5 pages. And you are restating content from that paragraph as if it is your own ideas, while accusing me of errors that are not actually there or that you cannot quote. The only errors you are citing is that I find fault with Jackson's study, and you think that I do not have a right to criticize errors in somebody else's linguistic research.

25faktorovich
mayo 4, 2022, 11:54 am

>18 anglemark: Now you are asking for second-hand quotes from my book, instead of just asking me for a review copy to check my citations for yourself?

26faktorovich
mayo 4, 2022, 11:58 am

>23 anglemark: Jackson and almost all of the computational-linguistic studies that have been discussed in this thread only use the word-frequency test when utilizing Stylo. I included other software that can be used to test word-frequency in my statement after the /

And no, the list of "several hundred potential tests" did not actually include many applicable tests that are not already included in my 27-tests, and there were certainly not over 100 different tests in the tossed-together list that was provided, and the list included various errors that I explained in response to that post.

27Keeline
mayo 4, 2022, 1:25 pm

>20 faktorovich:

Again, the ghostwriters in the Renaissance called themselves a Workshop (and "ghost-writers"), so that's why I called them the Ghostwriting Workshop.

Kindly show me an image of a vintage document from the 1500s (with full citation) that uses any variant of "ghostwriter" or "ghost-writer" or "ghost writer" or any variation of this spelling. The information available to me is that the phrase is not more than 100 years old. The concept is older.

"Workshop" is said to go back to 1556 though not in a literary sense.

When I call the Stratemeyer Syndicate (1905-1984) a "book packager," I know full well that the term was not in use when it was founded. However, modern groups who use similar methods are often called "book packagers" and to use the phrase today helps to communicate the methods used when a reader is familiar with that phrase. Of course, when I see "book packager" associated with the Stratemeyer Syndicate, it is a clear indication that the writer has seen my work directly or saw it indirectly since the small field of researchers had not applied the term before I did back in the 1990s.

James

28Keeline
mayo 4, 2022, 1:37 pm

>24 faktorovich:

You have an incorrect perception of what is permitted under "Fair Use" provisions under the U.S. Copyright law. Here is a summary which may be enlightening.

https://www.nolo.com/legal-encyclopedia/fair-use-rule-copyright-material-30100.h...

Five pages from a volume with hundreds that are reproduced in a small format is an illustration of the page layout. From a practical sense, court cases have found that usage of up to 10% of a work can be permitted in certain circumstances. There are lots of details and case details so we won't get in to those weeds.

But if you are very concerned with this, get the opinion of a qualified intellectual property attorney to see if you have a claim that is worth pursuing.

James

29Petroglyph
mayo 4, 2022, 2:11 pm

>18 anglemark:

Thanks, Linnéa, for going to all that effort! Ett stort tack!

The footnoted citations in that chapter are to p. 42, 40-47, 48, 55. So the bits you quoted indeed concern the relevant parts of that book.

I knew that looking at words with frequencies 2-6 and 2-10 would make sense in context! Faktorovich just calls it "confusion-generating" because the same words would be in both tests.

30Petroglyph
mayo 4, 2022, 2:25 pm

>19 faktorovich:

This is equivalent to rolling "1" on the dice every one of the 17 times you bet on "1" at the casino,

No. It's not like that at all.

The "most frequent three-word phrases" test is not one of the quantitative 27-tests, but rather is the 28th test

I know that. I said: "a result of 27 of your tests (with more added apophenia in the "most frequent three-word phrases")" But I can see how a careless reader who is eager to discover faults can miss that.

I began with a corpus of around 100 texts, and gradually expanded {...}

Not what I asked for. Excuses not to do it. I'm not asking for the definitive, ultimate, perfect result. I'm asking for the summary of all your testing in the form of a table summarizing the styles as you have used them.

If you want to know the current ranges, or to experiment with them, the data is available on my GitHub, so you can go ahead and check for yourself what the ranges are.

More excuses. Shifting the work onto the reviewer. Even here you're terrible at play-acting as a scholar.

You can re-order the table for the 27-tests to see which character/ word patterns appear in which of the texts and groups. and So just manipulate it and come up with the signature ranges that you want to see

Not my job, but yours. Just an excuse not to do it yourself. And I don't want to make any mistakes. When you do it, I know the job will be up to your standards and you'll have no reason to complain about others mucking things up.

My summary of these patterns in a single table would be simplistic and would lack all of the fine points in the complete set of data

Yes, that's why it's a summary. Another excuse.

you are saying that a summary table with a bunch of signature-defining statements would convince you?

No, that's you reading into my question.

Even if such a range is derived, if you choose a new text from the Renaissance and compare it to this pure single-author range, if the text in question is co-authored, it would fall between two pure ranges.

Yes, I know that. But also: if any of the six wrote (most of) that 'untested' text, it would fit. Duh. Is this another excuse?

Why are you requesting me to create a table for you with specific dimensions etc. that you already have in mind. and Why do you want to see such a table?

Why are you so suspicious? This summary table would be the crowning glory of your work! "Here are the six styles that are dominant in this corpus! Anyone can test a text and see who it was by!"

If you're worried about the collaborative thing, it would be easy to add one row for each possible collaboration.

This way, anyone could test texts from the Renaissance, and instead of doing the whole "replacing with ones and zeroes" thing just go straight to the summary table and check against an author.

It would be a quick verification of the whole thing. I can't, honestly, see a reason not to provide such a table.

31Petroglyph
Editado: mayo 4, 2022, 2:48 pm

>19 faktorovich:

I wanted to highlight this particular bit on its own:

I began with a corpus of around 100 texts, and gradually expanded it to 284 texts. With each text I have added, the "signature" for the author of the added text mutated slightly to accommodate its dimensions.

Put differently: Faktorovich adds a text to her corpus that is a little like, say, Percy or Sylvester or Verstegan, but that does not carry that "by-line". She then either expands the style of the author(s) closest to this text to encompass it. This "gradual" expansion happened ~180 times since the initial ~100-text corpus.

I just thought that methodological statement was worth commenting on.

32Petroglyph
mayo 4, 2022, 2:46 pm

>21 faktorovich:

Yeah, you've already told us you resorted to creative counting when faced with tests that yield results that are too erratic or cluster in too narrow a band to be useful.

It's one of the many reasons why everyone who is not you is justified in rejecting your dreck.

In your lunch break experiment (tm) (partial screenshot here), I count 12 tests where the difference is less than 6; there's 4 more where the difference is between 8-12.

More than half of your tests you have to force into your "analysis" by counting the red-colour way because the things they supposedly test for are irrelevant.

Thank you, lorax (if you're still here), for introducing me to the term fractally wrong. It's *chef's kiss* perfect!

33Petroglyph
mayo 4, 2022, 5:48 pm

Question for Faktorovich:

When you are detecting your apophenic "patterns" in the 6 most frequent words, do you normalize those data at all? Or do you work with the raw absolute frequencies?

A simple yes or no will suffice. Thanks!

34Petroglyph
mayo 4, 2022, 6:52 pm

Question for Faktorovich.

Several times, when I mention that, in terms of word frequencies, you only look at the six most frequent words, you have disagreed and said something along these lines:

No, I look at every single word within every word in every one of the texts I evaluate, and from all of these derive the unique 6 most-common words in each text. (Page 1, msg #1103)

Quick clarification question. What exactly do you mean by that?

Do you mean something like this: In order for the software to determine that the, of, in, is, ... are the six most frequent words, it had to count the frequencies of all the words in the text (or the corpus). Therefore, the analysis of the top six words implies the analysis of all others.

Is that an accurate paraphrase of your position? Am I understanding this correctly?

35prosfilaes
mayo 4, 2022, 8:09 pm

>22 faktorovich: If you are concerned eye-color alone is insufficient to achieve a match, this is exactly what I believe as well.

That is not what I was concerned about.

If you separate the children on this spectrum and then choose the 18% of children that are closest in eye color to a tested father; then, you will receive the specific answer to the question which of the children are similar in eye-color to each of the potential fathers.

If you have one blue eyed father, and 11 brown eyed fathers, and two green eyed children, and 10 brown eyed children, your system will happily tell you the two green eyed children are similar in eye color to the blue eyed father, and pick out two children to be matched with each brown eyed father, instead of saying there are zero close matches for the blue eyed father and 10 for the brown eyed fathers. You're hiding information and forcing it to give you a result, no matter what.

36faktorovich
mayo 4, 2022, 8:47 pm

>27 Keeline: You missed my previous post where I explain the earliest uses of this term:

---My definition is ground in lines such as these in “The Grounds of Divinity, plainly discovering the Mysteries of Christian Religion” (1633: https://www.google.com/books/edition/The_Grounds_of_Divinitie_plainely_discov/DV...: “The holy scriptures are all those Books of the Old and New Testament, by the direction and inspiration of the Holy Ghost, written, or approved by the Prophets and Apostles.” Or in the commentary on “Psalm XCIII” in “Holy Bible” (1635: https://www.google.com/books/edition/Holy_Bible_Faithfvlly_Translated_Into_En/WI...: “For in human books the writer and author is all one; but in divine, the Holy Ghost is the proper author, and a man is the writer.” The non-theological application of a similar meaning is found in a byline, “Written by Thomas Nash his Ghost, with Pap with a Hatchet” (1642 edition: https://quod.lib.umich.edu/e/eebo/A94724.0001.001?view=toc)--- (I go on to explain it further.)

Let me know if you need further clarification.

37faktorovich
mayo 4, 2022, 8:55 pm

>28 Keeline: If you took 10% of a book and published it (public posting is a publication) you would be committing an extreme copyright infringement; the 10% rule is likely to apply to a poem or another short work where 10% is not an extremely large quantity of information. And all of the information you are re-posting has to be discussed in the new piece of academic writing you are creating; it cannot be simply re-posted to show that a writer has a very long paragraph. Here is a list of songs that have been under copyright dispute/ settlement: https://en.wikipedia.org/wiki/List_of_songs_subject_to_plagiarism_disputes - I recall watching a documentary video from a lawyer that explained a recent case was over a few repeating notes that were similar between songs.

I am the "qualified intellectual property attorney" as I have been representing my own publishing company for over a decade. I have won lawsuits before pro se while suing other companies. I am not a lawyer, but I (and everybody else) can represent themselves in the US.

38faktorovich
mayo 4, 2022, 9:10 pm

>30 Petroglyph: "I'm asking for... a table summarizing the styles". I stated as much and explained in my previous post why such a summary table would not serve readers any better than looking at the full complex set of outputs in the complete data tables, which you can use to see the ranges that different ghostwriters fall between. You are again not reading any of my replies, and just repeating the questions, as if I did not already answer them.

Creating a simplification of my research with a table that compresses my intricate findings into only the highlights of those findings would not be the "crowning achievement", but rather would give you and others room to point out that the summary does not include the intricacies I would have to delete to create this abridgement.

The degree of collaboration (percentage of text written by each and if it is written by 2 or more ghostwriters) changes between most of the texts, so it would not be a matter of simply adding a row to quantify the range in a collaboration; in contrast, the tables I have already posted on GitHub provide this full range with all of the intricacies to allow researchers to explore these patterns more closely if this is a subject that interests them.

There is no simple attribution method, as even if there is a program to process texts, somebody has to research the implications of the quantitative results. The only way to simplify my method for readers is to create a program that applies the 27-tests automatically to any texts entered into the system (probably starting with the Renaissance or other periods I already have a completed corpus for). This is not something I have considered doing, as my method has been sufficient for me to solve the mysteries I am interested in researching.

39faktorovich
mayo 4, 2022, 9:15 pm

>31 Petroglyph: That is absolutely not what my statement means. I am describing the statistical distribution curve of output values that has to change slightly (if the corpus is large) or greatly (if the corpus is relatively small) every time a new text is added to a corpus. I did not add texts that are "like" any of the ghostwriters to "expand the style". I do not even have any idea what you are trying to say here. I added texts that were needed to verify alternative potential ghostwriter-bylines, or when texts were essential to establish the earliest and last texts a ghostwriter created, or otherwise were necessary for BRRAM (such as when they were discussed in previous computational-linguistic studies and I wanted to check these attributions with my own method).

40faktorovich
mayo 4, 2022, 9:17 pm

>32 Petroglyph: You have not read any of my replies that explain why all of the nonsensical accusations you are making against my method are false.

41faktorovich
mayo 4, 2022, 9:23 pm

>33 Petroglyph: "To help avoid dependence on the choice of measurement units, the data should be normalized or standardized. This involves transforming the data to fall within a smaller or common range such as −1, 1 or 0.0, 1.0." --https://www.sciencedirect.com/topics/computer-science/max-normalization

I normalize my data with my unconventional or new method. I compare each type of data units separately and then turn them into a binary (0/1) data table; this resulting data table is in a single unit 1/0, as is standard in the above quote where "0.0, 1.0" is an option.

42faktorovich
mayo 4, 2022, 9:26 pm

>34 Petroglyph: Yes, the top-6 words test checks the frequency of all words in the text to determine which are the most frequent words. The output this gives is a unique pattern of these words that frequently can identify the underlying ghostwriter on its own (in less cooperative corpuses than the Renaissance). I am not only testing the top words' frequency, but the combination of these words that a given ghostwriter uses that identifies their personality traits, when these are considered beyond their purely quantitative value for authorial attribution.

43faktorovich
mayo 4, 2022, 9:34 pm

>35 prosfilaes: If I only had a single test (such as eye-color), and all of the children were clones (or only had a father and no mother) and none of the children's eye colors matched any of the tested fathers; yes, this would have generated a glitch. Upon encountering such a glitch, I would have used a different testing method to determine parentage. This is why, instead, I used 27 different tests, considered thousands of potential authors/ "fathers", and otherwise maximized the degree of precision of the results. If my corpus had 104 authors for the 104 bylines in the corpus, the texts written by the same authors would have still been proximate to the other texts by that author, and would have been accurately attributed (as I found out when I tested this type of a multi-author corpus for the 18th century). I am not hiding any information; everything is on GitHub.

44lilithcat
mayo 4, 2022, 10:38 pm

>37 faktorovich:

I am the "qualified intellectual property attorney"

No, you are not. Your understanding of "fair use" is way off the mark. There are other factors than the amount of material published that are considered, and there is no bright line as to the percentage of material that may be used. To quote the U.S. Copyright Office, "there is no formula to ensure that a predetermined percentage or amount of a work—or specific number of words, lines, pages, copies—may be used without permission."

One can reproduce some of the work in order to criticize or comment on it - exactly what people are doing here. The fact that your work is published is a factor in favor of considering this fair use, and no one here is using the material for commercial purposes.

I (and everybody else) can represent themselves in the US

With the permission of the Court, which is not always granted.

45Crypto-Willobie
mayo 4, 2022, 10:51 pm

45!

46paradoxosalpha
Editado: mayo 5, 2022, 12:21 am

>36 faktorovich:

Really???

To suggest that reference to the "Holy Ghost" in claims for the divine inspiration of scripture, or "Thomas Nash his Ghost" whom the text shows to be a conventional post-mortem haunt, are either of them comparable to or the basis for what we refer to in contemporary English as "ghost writers" and "ghostwriting" is bonkers. "Insects are not animals"-level bonkers.

47Petroglyph
mayo 4, 2022, 11:34 pm

>38 faktorovich:

All I see are silly excuses for not providing the authorial signatures.

I stated as much and explained in my previous post why such a summary table would not serve readers any better than looking at the full complex set of outputs in the complete data tables

This is not true.

You are demanding that people wanting to check the "authorial-signatures" for themselves have to pour over your tables in detail, go through multiple steps and tables and put in a lot of time and work. This is an unreasonable expectation. Having the summarizing table available would be of great assistance. It would most definitely serve readers better than putting in unreasonable amounts of work.

Moreover: at the end of that unfamiliar process, there is no guarantee that they would have accomplished the goal successfully. You, however, are the only one familiar enough with your "newly-invented" and "unconventional" method to create this summary table with ease and, more importantly: without errors. You are the person who is in the ideal position to do so.

You have also not posted the actual text files that you have used -- I mean, your actual corpus. All the sonnets and plays and letters and all the rest. So readers who use differently-formatted texts from yours may get different results -- different file formats are just errors waiting to happen. You'll want to avoid that! The best way is to provide the table yourself.

as my method has been sufficient for me

And I am telling you that this laborious, painstaking method is not sufficient for other people. You need to keep in mind that other people are not as familiar with your methods, and may not want to invest the time needed to become fluent in them.

the summary does not include the intricacies I would have to delete to create this abridgement

Like I said: It need not be a perfect table that covers all the intricacies, and no rational person would expect that. All it needs to be is a clear overview of the six different authorial signatures and what values they consist of.

It should be straightforward enough to apply texts to them using your 27-test method, but of course allowance will be made for in-between cases.

If you like, you can add the titles of two or three texts that you think are mainly written by each person, so that the people using your test have a few illustrative examples to guide their own process.

You are again not reading any of my replies, and just repeating the questions, as if I did not already answer them

I am not ignoring your replies. I am stating that I find your answers and your objections insufficient. Your objections, to me, sound like you don't want to create such a table. But that is not a good enough reason. It is your job, your responsibility as an author, to make your results accessible to the readers in an easily-digestible format. Merely providing a link to multiple tables and expecting readers to figure it out for themselves by going through the entire process step by step is not good enough.

It is part of the author's responsibility that, when offering a new methodology, they should provide the tools to make things easier for others. The end result of all your labours should not be for readers to go through all the exact same steps as you have for several years. You, Faktorovich, who stand at the end of a long process, should be able to develop tools for others who want to employ your method. Tools that were not available to you at the start of this process, but that you, the only specialist in this method, are able to devise and provide. A summary table with a clear, unambiguous statement of the six authors' signatures is an ideal such tool.

For comparison: you imported an improperly formatted text in R once, and immediately gave up and claimed you had demonstrated that R introduces errors. Solutions were offered to you, but you had already decided that there was no point in using R. Do you want that to happen to you? You don't want people to give up at the first sign of trouble when confronted with an unfamiliar methodology. Which is an unfair reason to discard the whole enterprise.

Also, part of your schtick is that you use freely available tools and avoid complex statistics such as z-scores. Providing handy overviews fits right in with an orientation towards openness and applicability by laypeople.

But alright. I see that I have not been literal and explicit enough.

I want you to create this table because:

It would be the easiest, most straightforward way of applying your methodology to previously untested texts from the Renaissance for people who are not you, for laypeople.
It is unfair of an author to expect the reader to produce the summarizing tables that encapsulate the author's central claims. This is, properly speaking, part of your job and your responsibility in publishing a scholarly tome. Therefore, you, the author, should perform this work; it is unreasonable to expect it from the reader.
You are the only person who is capable of creating this table without errors. (Such is the burden of "newly-invented" and "unconventional" methods).
If the summarizing table I am asking for is insufficient to capture various degrees of collaboration, then so be it. Its value in discovering single-author works that were ghostwritten by someone from your secret workshop is enough. It need not be perfect or capture all the intricacies.
If you are truly serious about creating a methodology that is easy to use and involves nothing you need an advanced degree for, then you kinda have to provide some tools. Tools that were not available to you at the start but that you are now in a position to provide with the benefit of hindsight.

48Petroglyph
mayo 4, 2022, 11:36 pm

>42 faktorovich:

Yes, the top-6 words test checks the frequency of all words in the text to determine which are the most frequent words.

Ok, gotcha. Thanks for confirming that for me.

The output this gives is a unique pattern of these words that frequently can identify the underlying ghostwriter on its own

This is clearly apophenic nonsense. But we'll get to that later.

49Petroglyph
mayo 4, 2022, 11:37 pm

>40 faktorovich:

*shrugs* Well, I tried. Harder than most other people would have.

50Petroglyph
mayo 4, 2022, 11:39 pm

>41 faktorovich:

No, I meant, do you normalize your word frequencies in any way before you replace your data with ones and zeroes?

51Petroglyph
Editado: mayo 4, 2022, 11:59 pm

>24 faktorovich: By the way, you broke copyright law when you posted images of 5 pages from Volumes 1-2

As others have pointed out, fair use (in the US) covers my use of your pages for the purposes of criticism.

If you feel this deserves legal repercussions, have yourself call my people.

Alternatively, you can register a complaint with LibraryThing staff. The contact information and the evidence you need to provide are listed on this page, under the heading DMCA.

(also, small correction: I posted images of 4 pages in their entirety, a portion of a fifth page, and a tiny bit of a sixth.)

52Petroglyph
Editado: mayo 5, 2022, 1:43 am

To everyone who is not Faktorovich: Warning! Extremely tedious and nit-picking post coming up. Stay out! Go to >53 Petroglyph: instead!

>24 faktorovich: I did not make any mistakes of believing 2% of 10 million is 2,000; if I had you would have cited a specific quote where I said anything of this sort.

Ask, and you shall receive.

And in an experiment that considered a 10-million-word corpus, the 2,000 most frequent words (2% of all) comprised 80% of all words (8 million). (Faktorovich 2021, p. 474)

Here is a screenshot of that quote embedded in its context. Note: this screenshot is for the purposes of criticism.

There are five figures in this sentence. Four of these ("10 million, 2,000, 80% and 8 million") come from Faktorovich's source (O'Keeffe et al. 2007); and "2%", which is her own addition, an addition I believe to be in error.

I read all the figures in this sentence as referring to the same 10-million-word corpus. And the source that Faktorovich is relying on here for her numbers on that 10 million word corpus indeed relates its four figures to that corpus: "the first 2,000 or so word-forms do most of the work, accounting for more than 80% of all the words in spoken and written texts" (O'Keeffe et al. 2007, p. 32).

Furthermore, I read the sentence in that quote as a parallel construction that takes the form of a chiasmus. I have given what I believe to be the intended equivalents in each half the same formatting:

the 2000 most frequent words (2% of all)

vs.

80% of all words (8 million)

In my reading, the parallelism is there to set up a paradox: the small figures in the first half are revealed, in the second half, to have very large-figure effects.

The two figures in the second half both pertain to the 10-million-word corpus; both, in fact, refer to the same proportion: 8 million is indeed 80% of 10 million. Furthermore, the "8 million" in the second half is between brackets, and serves as an explicitation of the "80%" figure. So the most natural reading for the first half is to set up that parallelism, and so I interpret it along similar lines as the second half: the author is (mistakenly) saying that 2,000 and 2% are the same proportion out of ten million, and is using brackets to explicitate the figure of "2000" with respect to the 10-million word corpus.

Finally, the part that says "2% of all" -- I interpret its intended meaning to be "2% of all the words in the 10-million-word corpus", as a parallelism to that "80% of all" in the second half.

If this reading is incorrect, I would appreciate an explanation of where that figure of 2% came from and what its import is in the discussion of the 2000 most frequent word forms.

So, a typo? A genuine error?

I think this is a genuine error. I think the "2%" should have been omitted entirely -- that is the only way for this sentence to make sense.

After all, the 10-million-word corpus does not contain 10 million unique words; there'll be many, many tens of thousands of instances of the, of, is, in, to, I, he, she, from, ... so amending the "2%" to "0.02%" makes no sense -- 0.02% is clearly not the same as 80%. And you can't change the "2000" figure -- that's right there in the source.

The top 2000 word-forms (or "types") will, of course, be some proportion of all the word-forms in the frequency list. But that proportion is unlikely to be anywhere near 2%. O'Keeffe et al. don't give an exact number for the number of types in this corpus, but looking at the graph on that same page, it's above ten thousand; their discussion on pp. 48-49 suggests something between 16,000 and 20,000 (that is, the 10 million tokens in the corpus fall into ~16,000-20,000 types). So the percentage that Faktorovich should have added ought to have been "10%", at the very least.

I think Faktorovich added the 2% because she's confused between word-forms and the individual words in the corpus: the type the vs all the individual tokens of "the" in the corpus (the top 2000 types (or word-forms, as O'Keeffe et al. put it) account for 80% of the tokens in the corpus). Faktorovich shouldn't have added that "2%", but she did out of confusion -- just one more kink in the incredible muddle that is the section I highlighted in >17 Petroglyph:. But that is only my reading, and I'm happy to be corrected in this respect.

Faktorovich, Anna. 2021. The Re-Attribution of the British Renaissance Corpus. 1st ed. British Renaissance Re-Attribution and Modernization Series 1–2. Quanah: Anaphora Literary Press.

O’Keeffe, Anne & Michael McCarthy, Ronald Carter. 2007. From Corpus to Classroom: Language Use and Language Teaching. Cambridge: Cambridge University Press.

53Petroglyph
mayo 5, 2022, 1:43 am

TL;DR of >52 Petroglyph:: Faktorovich is fractally wrong.

54faktorovich
mayo 5, 2022, 11:33 am

>44 lilithcat: You imagine that it would be legal for you to take a couple of chapters out of "Harry Potter" and publish them online? While there are plenty of pirating websites that sell texts without permission from the publisher. There are absolutely no cases where publishing 5 pages with the original design would be technically legal. Pirates only get away with such actions if the publishers do not want to invest in paying for a lawyer's hours to pursue a lawsuit over minor cases, reserving such prosecution for major financially damaging cases. The "U.S. Copyright Office" is not a lawyer, or a judge; so this statement from them is not at all legally binding. Yes, there are indeed "formulas" for what constitutes an infringement - this formula is the amount of words etc. per text that was previously found to be a violation in case law. So research case law for specific book piracy cases, and then cite specific instances where publishing 5 pages without scholarly engagement with all of the main points of this text was excused as not a violation of copyrights. In all cases I have won in court, I was given permission to represent myself, so these count as experience to make further self-representation in court more likely to be granted.

55faktorovich
mayo 5, 2022, 11:45 am

>46 paradoxosalpha: The "Thomas Nash his Ghost" is equivalent grammatically to Modern English's "Thomas Nash's Ghost". This is only one example of this type of wording from the Renaissance. As I explained in my original explanation in the previous thread #1, the Workshop regularly attributed texts to bylines for the first time after the "author" had died, as was the case with "Marlowe" and "Philip Sidney". In these cases, and in many similar circumstances, the prefaces/ title-pages etc. tend to refer to ghosts writing, or the spirit of the dead, or how they live on beyond their time of death through their writing. The Workshop was actively selling to wealthy people, like Mary Sidney, the glorification of their loved ones, like Philip Sidney (her brother), through the bylines of texts attributed to him, even if he did not write anything during his short life. And they were also using the bylines of the dead because texts by the deceased tended to sell better, as the public had sympathy over their tragic passing. The direct note "Thomas Nash his Ghost" simply restates this insider joke about "ghost-writing". I explain this point across BRRAM. The idea of a dead "author" starting to write after their death with help from a ghostwriter is exactly what this term is designed to mean. Yes, it can also be applied to ghostwriters writing for living "authors", who hire them for the service, or living people who are credited with scandalous text etc. There is no better term in the English language that describes this spectrum of authorship activities under other bylines.

56lilithcat
mayo 5, 2022, 11:58 am

>54 faktorovich:

Yes, there are indeed "formulas" for what constitutes an infringement - this formula is the amount of words etc. per text that was previously found to be a violation in case law.

This is simply not true. You seem to think that the only factor in determining "fair use" is the amount of "words etc. per text". But there are other factors which you are ignoring, and courts consider all the factors in determining whether fair use applies.

57Keeline
mayo 5, 2022, 12:01 pm

>54 faktorovich:

It is true that there is no "bright line" or "magic formula" for the portion of a work that can be copied. When these cases do make it to court, it has less to do with the length of the material and more to do with the permissible uses granted under the Fair Use provision. There is a lengthy discussion on the Wikipedia page ( https://en.wikipedia.org/wiki/Fair_use ) with many case examples cited. For example:

The four factors of analysis for fair use set forth above derive from the opinion of Joseph Story in Folsom v. Marsh,5 in which the defendant had copied 353 pages from the plaintiff's 12-volume biography of George Washington in order to produce a separate two-volume work of his own.10 The court rejected the defendant's fair use defense with the following explanation:

A reviewer may fairly cite largely from the original work, if his design be really and truly to use the passages for the purposes of fair and reasonable criticism. On the other hand, it is as clear, that if he thus cites the most important parts of the work, with a view, not to criticize, but to supersede the use of the original work, and substitute the review for it, such a use will be deemed in law a piracy ...

In short, we must often ... look to the nature and objects of the selections made, the quantity and value of the materials used, and the degree in which the use may prejudice the sale, or diminish the profits, or supersede the objects, of the original work.

There is not the slightest chance that any reasonable person would consider the less than half a dozen images shown to be a replacement for the published work.

One of the reasons that there is an old proverb that says

A Man Who Is His Own Lawyer Has A Fool for a Client

is that having another person who is expert in the law review a claim will inform when the case is unlikely to prevail under the statute and case law. If one is the only person reviewing a case, it is too easy to let emotions guide decisions.

Here is another resource on Fair Use by the Copyright Alliance ( https://copyrightalliance.org/faqs/what-is-fair-use/ ).

Critique and academic discussion are key permissible uses under Fair Use that fully apply to this thread.

James

58paradoxosalpha
Editado: mayo 5, 2022, 4:46 pm

>55 faktorovich: The "Thomas Nash his Ghost" is equivalent grammatically to Modern English's "Thomas Nash's Ghost".

Ho, look! There's a ramp to a disagreement no one was trying to have! Turn before it's too late!

Edited to add: In "Thomas Nash his Ghost," the deceased Nash is the putative writer in whose name the text is issued, the pretended byline as it were. That is the exact opposite of the uncredited "ghost" publishing to the credit of someone else's false authorship. If I were a political strategist who wrote an op-ed to be published over a local official's name, he wouldn't be the "ghost." It would be me.

Similar deficiencies obtain in your argument from "ghostly" inspiration of scripture.

59faktorovich
mayo 5, 2022, 1:02 pm

>47 Petroglyph: I have already created all of the "summary tables" that my studied needed to assist readers with understanding the findings. There are several different tables for the Renaissance that list different elements of the study that might be of use to future researchers. I explain what this data signifies and enhance it with additional research across BRRAM. There is nothing about what you are demanding I do to delete data out my tables to create short-answer cheat-sheets that would be at all useful to researchers who want to understand my exact findings. The texts I used are mostly digitized and available in the sources I list in the Bibliography. I cleaned them up using a standard method that most users should also use, and thus get very similar results. It would be unethical for me to re-publish hundreds of EEBO books on my own website (i.e.: copyrights).

Why are you assuming I have any invested interest in anybody using my method? It would indeed be a great benefit to them to use my method, as they would derive accurate attribution results. But there would be no benefit to me, especially if they did so without crediting my method, or books. And it is absolutely absurd to demand for a computational-linguistic method to be "easy". The process of establishing the true authorship of a pseudonymous text is as difficult of a challenge as solving a 400 year murder mystery. Anybody who thinks that an automatic answer will be generated without significant input of labor, should not be going into this field (which combines many skills, such as reading the texts in question and dissecting them beyond merely their linguistic content). Creating a program that makes this process seem easy by finding matches between isolated texts would be unethical, as establishing a similar/dissimilar answer is only a step in the process, whereas the more important step is checking for other potential similarities with other bylines etc. If you do not have the concentration to read over the entire methodology section in Volumes 1-2, and to closely review the provided tables on GitHub; you need to stop criticizing BRRAM, as you are repeatedly explaining that you have not even understood the basic steps of the method, nor analyzed the posted data. The data I have provided can be re-ordered in the Excel files to check what the ranges are for each of the ghostwriters; if you do not know how to do this, again, you need to stop complaining about a subject you lack the capacity to process. The data tables I have provided are extremely easy to use, as they include intricate information with simple headers that explain what information is contained in which section.

If you or anybody else told me that you want to apply my method, but you are struggling with its complexity, and want my help to make it easier. I would absolutely help you to make it easier. But that's not why you are posting this tirade, as your only goal is to discredit my method by proving it is too difficult for you (or anybody) to use. If you are actually trying to use my method, maybe begin by explaining what about my existing tables is confusing you, and what about my steps is taking too long. There should be a way to solve this problem for you, but it is not making a table with simple ranges.

I did follow instructions to adjust my employment of R/Stylo until I realized that the program is designed to be inaccessible and unusable to a general user, even if they read all of the available manuals (that do not explain the basic steps you guys instead had to give to me in this thread). In contrast, you have not told me about any glitches you have encountered with my method, as there are no such glitches.

The data tables are an overview of my findings. A summary of ranges for the ghostwriters is a simplification that only fits your request, and would only lead casual researchers to miss-quote my findings.

You are attempting to force your method onto me. It is not at all a part of my method to identify ghostwriters by any summary table of what I expect from their style. I attribute texts by putting them through the 27-tests, and I do not recommend to anybody to make attributions from a ranges table instead. It is just as wrong for somebody to engage in computational-linguistic authorial attribution without a PhD in English or computer science, as it would be for somebody to engage in a murder-investigation without being a licensed detective. One can stumble into the right answer, but one can also generate a false-positive, which can have detrimental consequences. Laypeople can use my method and they will generate accurate results, but interpreting their meaning will be challenging; it would be absolutely unethical for me to create a tool that would erode any degree of precision from this already simple approach to attribution.

60faktorovich
mayo 5, 2022, 1:06 pm

>50 Petroglyph: It would be a nonsensically duplicating step to first normalize the data (PS: not all of it is in frequencies) into 1/0 or another -1 to 1 system and then again normalize it by replacing it with 1s and 0s.

61faktorovich
mayo 5, 2022, 1:16 pm

>51 Petroglyph: If I felt that your conduct deserved legal repercussions, I would have already proceeded with filing a lawsuit. Most such lawsuits require that the aggrieved party first tell the infringer that one believes one is being infringed on. When I inform you that you are committing libel or copyrights infringement across this discussion, I am not doing it for the joy of complaining, but because these are required steps in solving such disputes. For example, if I told you that it is an infringement to post 5-6 pages of my book, and you added to it by posting 20 pages; you would be escalating the problem, after being informed it is a problem. Your knowledge of your actions being wrong is a significant ingredient in such cases. It would not be of benefit to me to go through your "people", as posting my complaints in this public forum is a legally recognized delivery of information to you (the infringing party), and it also serves as a record because it is public and seen by many other people who can testify of seeing me share the information. There is absolutely nothing to be gained by complaining to LibraryThing, as, for example, the rep from LibraryThing posted insulting remarks about my research at the top of this "2" thread: "she's jumped to conclusions based on suspicious data generated by faulty methodology". I have previously complained to LibraryThing about harassing posts in this thread, and they responded that they do not feel you guys are doing anything wrong. This is the standard response from all content-"moderating" companies. I am not at all surprised by it. I am just explaining to you how it all works, in case you are not aware.

62coprime
mayo 5, 2022, 1:31 pm

>61 faktorovich:

the rep from LibraryThing posted insulting remarks about my research at the top of this "2" thread

amanda4242 is not a LibraryThing employee, merely the person who created a new thread in order to spare all our computers from trying to load the original thread. No one from LibraryThing has commented in this thread continuation. You can tell because when LibraryThing employees post there's a little "L" icon in front of their username.

63faktorovich
mayo 5, 2022, 1:54 pm

>52 Petroglyph: One of the reasons scholars do not post images of quotes, but rather transcribe the entire section that one is criticizing is because it allows them to re-read the section as they transcribe it to re-consider its meaning. So, to review my own intention in this isolated quote, I am not going to read over your rambling comments about 2%, and will instead rephrase what the rest of the paragraph in question stated. I begin by pointing out that it is nonsensical for a computational-linguist such as Jackson to claim that he is checking word frequencies for words “occurring 2-6 times in the canon” and those “occurring 2-10 times”. This statement duplicates of repeats the testing of words that occur 2-6 times within the second phrase, while adding a few of the more frequent 7-10 words. Then, I explain that in a corpus will millions of words, 40-75% can be "hapax legomena" or 1-time occurring words; larger texts lean towards 40% and shorter texts to 75%. The size of the text thus significantly impacts the word-frequency measure, and most of these computational-linguists chop up small fragments of text, such as 500-words, and thus have samples with around 75% of 1-time words. Then, I come to the sentence in question. I am citing Kennedy's book when I summarize his findings, which would make more sense if a reader read those before reading my summary. The "10-million words corpus" measure describes the total number of words in a corpus being tested (not in individual texts, and not the number of words when repeating words are only counted once). The "2,000 most frequent words" phrase clearly explains that it is referring to the "most frequent words" (or they are counted only once in this 2,000 measure, so that if "a" occurs 500 times, it would only add 1 word to this "2,000" count). I can see how a reader can be confused what "2% of all" and "80% of all words (8 million)" signifies, but there is enough information here for the reader to guess the 2% is not arrived at by a division of 2,000 by 10 million. 80% are indeed of 10 million. And 2% is a unique measure that Kennedy calculates in his book when he counts the number of words that are part of these 2,000 frequent words (when counting each of these separately). If there is something wrong with this 2% figure, it would be a good idea to complain to Kennedy. My paragraph is explaining the errors with previous computational-linguistic studies, so if you believe you have found an error in Kennedy's claim, you are only adding support for my argument. My point is in the following sentences, where I explain that excluding the 1-time occurring words and those that occur more than 10-times, Jackson is excluding the significant most-frequent words; and additionally, Kennedy's and other linguists claims that such words would make up 80%+75% or 155% of a text cannot be logical, so something must have gone wrong in their calculation of what percentage most and least frequent words make up. The confusion over the 2%/80% calculation is precisely what I am criticizing in this paragraph. I then conclude that the percentage of 1-occurence, or 10+ occurring words, or those in the middle becomes irrelevant, when one simply compares the most-frequent words, which establish a linguistic pattern similarly in smaller and larger text sizes.

64amanda4242
mayo 5, 2022, 1:55 pm

>61 faktorovich: As >62 coprime: says, I am not a LT employee. I'm merely the member who continued the previous thread.

My summary in >2 amanda4242: reads, "The author reacts poorly when everyone says she's jumped to conclusions based on suspicious data generated by faulty methodology." I am summarizing the opinions expressed by posters on the previous thread; that many of those posters expressed disbelief in your work is not an insult but a statement of fact.

65faktorovich
mayo 5, 2022, 1:58 pm

>56 lilithcat: There are indeed many factors in a "fair use" dispute. Neither you in your post, nor I in my post named all possible factors that can impact a "fair use" judgement. I have pointed to the essential points you should all be aware of because they have been specifically violated in this discussion. And courts cannot logically consider "all the factors" in anything; they can only consider the relevant factors. There is actually an objection for "irrelevance" that can be made in court.

66faktorovich
mayo 5, 2022, 2:06 pm

>57 Keeline: The actual case you cite "Folsom v. Marsh", if it (or the section you cite) was used as the sole case to base the decision on, would have decided a case about the use of the 6 pages in my favor. "The court rejected the defendant's fair use defense". The explanation then clarifies that the duplication and publication of another's copyrighted materials at-length without engagement with all of it in scholarship is a violation. Similarly, posting 5/6 pages from my book, when only discussing that it is a long paragraph is a violation. The intent of posting such reproductions can be not to profit from the resale, but rather to decrease the number of people who would need to purchase my book, by making a section of it criticized as the worst part of the book publicly available for free. Publishing a pirated book to shrink profits for the author is just as much of a violation of copyrights, as selling the book or section of a book for direct profit.

67faktorovich
mayo 5, 2022, 2:08 pm

>62 coprime: Amanda conducted an interview with me on behalf of LibraryThing, saying it was for their newsletter. If she is not affiliated with LibraryThing, this is indeed a curious line of inquiry.

68FAMeulstee
mayo 5, 2022, 2:13 pm

>67 faktorovich: No she didn't, that was AbigailAdams26, look at the top of the previous thread!

69faktorovich
mayo 5, 2022, 2:15 pm

>64 amanda4242: Yes, this is why you believe it is acceptable for you to make such a post. But your summary and other posts are all extremely biased against me in a manner that is more of a direct attack than many of the other posters. For example, summarizing that "everyone" says my "suspicious data" is "generated by faulty methodology" includes the speaker, Amanda, in the statement, so you are expressing your own opinion as well as "everyone" else's. If you are going to insult a researcher in a public forum, there is no room to hide behind such publicly made comments, so you might as well acknowledge you are doing so with malicious bias. My data is fully accessible so it is not "suspicious". And my methodology works to generate precisely accurate attribution results, so it is not "faulty". Instead of providing any evidence to support such libelous accusations, "everyone" here has instead been just repeating the libelous statements under the assumption that if you repeat a lie enough times, it becomes the truth. But this is just never going to happen, when the truth is still the truth, and my data and method speak extensively for themselves.

70amanda4242
mayo 5, 2022, 2:17 pm

>67 faktorovich: I most certainly did not conduct the interview! AbigailAdams26 did and I clearly state in >1 amanda4242: that I was copying and pasting Abigail's introduction from the previous thread. Careful reading will save you from making such embarrassing mistakes.

71faktorovich
mayo 5, 2022, 2:19 pm

>68 FAMeulstee: Thanks for pointing this out that is helpful. Since I did not check to see the name of the person who had interviewed me, other readers are likely to have assumed Amanda was a LibraryThing editor, and that's why she took it upon herself to police the previous thread, push everybody off it and push everybody into this new thread. It helps to have this explanation at the top of the thread for anybody who might reach a similar misunderstanding.

72coprime
mayo 5, 2022, 2:19 pm

>67 faktorovich:

Amanda conducted an interview with me on behalf of LibraryThing, saying it was for their newsletter. If she is not affiliated with LibraryThing, this is indeed a curious line of inquiry.

There are two people involved here. One is AbigailAdams26 who is a LibraryThing employee (again, you can tell because there is an "L" icon beside her username). AbigailAdams26 is the person who started the original thread and the person who interviewed you.

The other person is amanda4242. amanda4242 is the person who continued this thread in a second thread because the original took a while to load due to how long it had gotten. It is common LibraryThing user etiquette for someone to start a new thread when the original gets long; anyone can do this. Doing this is not reserved for LibraryThing employees and is frequently done by people who are not LibraryThing employees.

amanda4242 a different person from AbigailAdams26.

73paradoxosalpha
Editado: mayo 5, 2022, 4:05 pm

>68 FAMeulstee:

All those "A"-initialed people look the same.

>71 faktorovich: other readers are likely to have assumed

Not everybody has the chip on your shoulder or your paranoid ideation.

74FAMeulstee
mayo 5, 2022, 2:29 pm

>71 faktorovich: No she didn't, it is usual. Anyone can continue a thread when it becomes way too long.
No one else but you would assume she is a LibraryThing employee. As you could see at the top of the previous thread, there is a "L-icon" next to the name of LibraryThing employees.

>73 paradoxosalpha: LOL, my first name starts also with an "A", I might look identical ;-)

75Keeline
mayo 5, 2022, 3:11 pm

>66 faktorovich: Kindly read it again and note the basis for the rejection:

defendant had copied 353 pages

and the court's decision:

A reviewer may fairly cite largely from the original work, if his design be really and truly to use the passages for the purposes of fair and reasonable criticism. On the other hand, it is as clear, that if he thus cites the most important parts of the work, with a view, not to criticize, but to supersede the use of the original work, and substitute the review for it, such a use will be deemed in law a piracy ...

Are you really trying to claim that those 5 or so pages represents the essential part of your 17+ volume series?

This is not the same case since the usage here is to analyze and critique an aspect of your presentation and content. This is how academic reviews work in most cases. Some diagram or text is presented and the critique responds to it.

In a fair world, lawsuits would be the complete responsibility of the losing party, including any attorney fees to defend against the frivolous claims. Sometimes this is true. Often it is not.

In reality, most copyright and other IP infringement claims are settled out of court, making the circumstances of any given case harder to use as a precedent for future cases.

James

76amanda4242
mayo 5, 2022, 3:19 pm

I just want to make a few things very, very clear:

I am not now, nor have I ever been, a LibraryThing employee.

I have never claimed to be a LT employee.

I have never conducted an interview on behalf of LT.

I have never claimed I have conducted an interview on behalf of LT.

I made it clear in >1 amanda4242: that I was copying and pasting AbigailAdams26 introductory posts by making the very first sentence of >1 amanda4242: "Copied and pasted from AbigailAdams26 introductory post https://www.librarything.com/topic/337240#7670449."

All LT members have the ability to continue a thread once it surpasses 150 posts. All it takes to continue a thread is the willingness to click the "Continue this topic in another topic" link that appears once a thread has gone over 150 posts.

As I stated in post #1148, the previous thread was taking far too long to load so I continued it as a courtesy to other members.

I did not continue the thread to try to halt the discussion, but rather to make it easier for it to continue. I even explained in post #1153 how to link to posts in the previous thread so members could easily reference the previous thread.

Many people expressed unflattering opinions of faktorovich's study in the previous thread. To summarize part of the thread by saying "everyone says she's jumped to conclusions based on suspicious data generated by faulty methodology" is not an insult, but a statement of fact.

That I also have an unflattering opinion of faktorovich's work does not make the summary any less true.

I am not AbigailAdams26.

77Petroglyph
mayo 5, 2022, 3:34 pm

I'll just leave this here; it seems apposite.

Faktorovich, page 1 msg #1050: "I have not made a single false statement in this thread, and I don't believe I have ever made a false statement in my life."

78Keeline
mayo 5, 2022, 3:35 pm

>36 faktorovich: Yes, I think more clarification is called for. Do you have other examples?

This seems like another example of "reading between the lines" (or "subtext") to go from a literal meaning to the figurative meaning that you wish to apply.

I don't see "ghost writer" with anything or nothing between the words in your example.

The usage in some religions of "the Holy Ghost" has nothing at all to do with writing or ghostwriting. It would be expected to find "ghost" used in a Bible or similar religious volume.

Perhaps this blog-like article is not authoritative. But the author claims that "ghostwriter" was coined in 1921 ( https://www.thewritersforhire.com/whats-the-origin-of-the-term-ghostwriter-and-h... ).

It seems like it would be earlier and certainly the concept is much older, just not the word. Stratemeyer always referred to his people as "writers" not "ghostwriters". I have looked at hundreds (perhaps thousands) of his letters from the first 25 years of his Syndicate. When a "ghost" is mentioned, it is a supernatural spirit or at least the representation of one since "ghosts are not real" is a tenet of these books (unlike the recent CW Nancy Drew TV show). Here's one example that Stratemeyer personally wrote.

https://stratemeyer.org/edward-stratemeyer-author-and-book-packager/dime-novels-...

{The original story in a newspaper did not have an illustration. A period illustration was added which fits the text.}

The extra term is needed to clarify a distinction between the owner of a copyrighted work and one who may have contributed to the content of the work such as writing some or all of it. For copyright, the "author" is the "owner," even when there is a clear transfer of title for the work (product and labor).

James

79anglemark
mayo 5, 2022, 3:40 pm

Is it just me, or does anyone else see it as a bad sign that the owner of a small publishing house doesn't know relatively basic things about copyright?

Anyway, >29 Petroglyph: – my pleasure, and I have to say that as frustrating it is to read even a single post by Faktorovich, it is counterbalanced by the posts by your good self and everyone else who does know what they are talking about. In fact, I'm becoming more and more inspired to get back to doing a bit of research, myself, and that can't be a bad thing :-)

-Linnéa

80Petroglyph
mayo 5, 2022, 3:41 pm

>61 faktorovich:
K.

81Petroglyph
mayo 5, 2022, 3:46 pm

>63 faktorovich:
I am not going to read over your rambling comments about 2%, and will instead rephrase what the rest of the paragraph in question stated

Yeah: your usual modus operandi.

In brief: you pulled the 2% figure from yet another source, and felt it was appropriate to add this proportional figure into a completely different context and apply this proportional figure to some unrelated numbers. Alright.

I was wrong. You're not just confused about word types and word tokens; you're even more confused. There's always a deeper fractal.

82Petroglyph
Editado: mayo 5, 2022, 3:59 pm

>59 faktorovich:

Well, it is my suspicion that the only way to achieve your results is to repeat all your steps one by one. Sounds like you agree with me.

I'll stop prodding you to provide a clear, unambiguous, non-overlapping set of criteria that demarkate your ghostwriters from one another. Pity. It would be such a revealing table.

83anglemark
mayo 5, 2022, 3:59 pm

>78 Keeline: The earliest attestation of "ghost writer" in the OED is from 1908, from a local newspaper in Lincoln (Nebraska): "There are a dozen or so known, catalogued and labeled ‘ghost’ writers in town. Few of them have ever seen their names in print."

"Ghost-written" is first attested in Arthur Ransome's Bohemia in London, published in 1907. "There are agents who make a living by supplying ghost-written books to publishers who keep up for appearance sake the pretence of not being in the know."

The very earliest use of "ghost" meaning "ghostwriter that the OED mentions is from as early as 1881 (so still more than 200 years after the end of the English Renaissance) : "They have never been introduced to a ‘ghost’—that is to say, a person employed by incompetent artists secretly to do up their work and make it artistic." Vanity Fair 24 Sept 1881.

-Linnéa

84anglemark
Editado: mayo 5, 2022, 4:07 pm

>60 faktorovich: That's.... not what normalizing is.

-Linnéa

85Petroglyph
mayo 5, 2022, 4:09 pm

>73 paradoxosalpha: All those "A"-initialed people look the same.

"Well if they didn't want me to confuse them, they shouldn't have put numbers in their name."

86Petroglyph
mayo 5, 2022, 4:11 pm

>79 anglemark:

Cheers!

The fun thing about doing more research is that you can do it pretty much any time. Sadly, that is also one of the drawbacks: you can't just leave it at work and stop thinking about it. But if this whole debacle inspires you to do research, it'll have been worth it!

87lorax
mayo 5, 2022, 4:21 pm

faktorovich (#71):

I know you have said that you never make mistakes, but this is one case where saying "oops, got the names mixed up, sorry" would really be to your advantage. Nobody else thought amanda4242 and AbigailAdams26 are the same person just because their names both start with the same letter, especially since amanda4242 doesn't have the characteristic "L" icon indicating an LT employee. (It's okay that you didn't notice it, especially if prior to this thread you weren't active on Talk much, but don't assume everyone else made the same error.)

88lilithcat
Editado: mayo 5, 2022, 4:25 pm

>78 Keeline:

Oh, never mind. Next time I'll read all the posts . . .

But the author claims that "ghostwriter" was coined in 1921

The OED gives an earlier use, citing a July 23, 1908, article in the Lincoln, Nebraska, Daily Star: "There are a dozen or so known, catalogued and labeled ‘ghost’ writers in town. Few of them have ever seen their names in print."

89Keeline
mayo 5, 2022, 4:44 pm

>83 anglemark: That seems closer to what I would imagine for public usage of the phrase. As I wrote, the 1921 seemed hard to believe, particularly with the extensive use of the process in the 19thC by story paper and dime novel publishers. It might go back earlier than 1881.

Often there is an understanding of a writer adopting a personal nom de plume ("Mark Twain" = Samuel L. Clemens, etc.) but "house names" are far less understood, even by many today. Library cataloging systems, including LT, don't have good ways to handle the published name vs. the person or people who worked on it for even a single work. But that is a topic for another thread.

Curious, I looked at my notes and transcriptions of Syndicate correspondence with 4,372 outgoing letters and 2,656 incoming letters. I have page images for tens of thousands of pages but a good portion of them are not relevant for my work. I don't note and transcribe everything, mainly the interesting ones.

Edward Stratemeyer died on May 10, 1930. As I wrote, he called his people "writers" and "artists" (for those who made illustrations). The first use of "ghost writing" appeared in a 1943 letter to a publisher editor when she passed along a request of an autographed Nancy Drew book. Here are the first two paragraphs of the reply:

Following a precedent decided upon years ago among Mr. Stratemeyer, Mr. Alexander Grosset and Mr. Reed, none of the series books ever have been autographed. It is unfortunate that this word was not passed along to you when you became head of the juvenile department, and I hasten to explain the reasons to you, why it is inadvisable for our writers to give out autographs.

In the first place, the names on the series are pseudonyms, which are the property of the Syndicate. Most of the authors of the stories are engaged in some other type of work which makes it inexpedient for their identity to be revealed, and in a few case in the past where a writer has died, the series has been carried on by someone else. Furthermore, as you know, our Syndicate retains all rights to these series, and the "ghost" writers have no further claim on them, once the manuscripts are accepted by us. You can readily see why it has seemed best not only to keep these matters a business secret among Grosset and Dunlap, our writers and ourselves, but also why it is not feasible to have signatures put in our books.

I would note that the Syndicate did sign some books in later decades, particularly in the 1970s.

There are other references to ghost in this context in the 1940s and 1950s. By this point the conventional usage of "ghostwriter" was being used by the principals in the Stratemeyer Syndicate.

In this field there are many terms that collectors have coined to describe features or processes connected with series books. For example, the practice of issuing three or perhaps more books at one time to get a series launched are called "a breeder set" or "breeders" by collectors. But you don't find this term in the contracts and correspondence of the period for the Syndicate. The earliest use I have found for series books is in a 1976 memoir by Leslie McFarlane and this is probably where the collectors read it and started to use it.

The collectors refer to blue spine Hardy Boys and yellow spine Nancy Drew books as "pictorial covers" (1962+ for most titles) but the trade referred to these as "art boards" in the correspondence. Other terms were used by book manufacturers in the trade publications. Similarly, a "next-title announcement" used by collectors is a "throw-ahead notice" by people working on the books.

There are many more examples where the terminology after the fact is different than what the people doing the work would use. That's why I was really surprised by the claim that "ghost writing" in ay variant was used in the 1500s. I shall await an example I can actually see.

James

90Petroglyph
Editado: mayo 5, 2022, 5:48 pm

>60 faktorovich:

It would be a nonsensically duplicating step to first normalize the data (PS: not all of it is in frequencies) into 1/0 or another -1 to 1 system and then again normalize it by replacing it with 1s and 0s.

Ok. If that is your stated opinion.

So please explain to me why you normalize the frequencies of the punctuation marks you include in your tables before you do your ones/zeroes thing to these figures.

Edit: and please explain why you also normalize the parts of speech counts. (But to a different baseline than the punctuation marks.)

91Petroglyph
mayo 5, 2022, 4:55 pm

>83 anglemark:
>89 Keeline:

It's very common for phrases to be in use for a while before they are put down in writing. The OED staff are very proficient in tracking down very early (if not the earliest) recorded attestations, so missing it by like four hundred years is not something I would expect from them.

92Keeline
mayo 5, 2022, 5:30 pm

>91 Petroglyph: In the case of the Lincoln, Nebraska newspaper, when I read it (from NewspaperArchive.com) it sounds like a reprint from another publication like one of the New York City newspapers (often without credit). This was a very common occurrence. Here is the article.

James

93Petroglyph
mayo 5, 2022, 5:58 pm

>92 Keeline:
Thanks for tracking down that article!

Archival research always turns up these whiplash-inducing articles -- I thought that "the ermine -- née tiger skin -- rug" was funny. And then I got to the laundry list bit.

94faktorovich
mayo 5, 2022, 9:03 pm

>75 Keeline: There are no non-essential pages in my 17+ volume series, and it would be absurd to consider these pages as a percentage of the full series and not at most only of Volumes 1-2 (as this is a separate book with its own ISBN number). After I informed Petroglyph that I viewed his posting of designed pages from my book as a plagiarism, he posted at least one other different page deliberately to harass me (after having learned this activity troubles me). This would prove malicious intent, as opposed to an accidental or an unknowing piracy of copyrighted materials. He did not discuss the content of the pages in question, so he did not comply with the "fair use" allowance. "Academic reviews" are when somebody engages with the content they are borrowing to make an argument of their own with help from this content. But failing to review any details or to summarize the material in question beyond just saying it's a long paragraph; Petroglyph did not meet the minimum requirements for "fair" scholarly "use". I have never filed a "frivolous" lawsuit. I only file when there is overwhelming evidence that I have been wronged. Since Petroglyph cannot even comply with my request to stop posting screenshots and to only post transcribed quotes of no more than a paragraph and with explanations of the content; it is unlikely he would ever "settle" any case.

95lilithcat
mayo 5, 2022, 9:11 pm

>94 faktorovich:

I viewed his posting of designed pages from my book as a plagiarism

How could this possibly be plagiarism? The last thing Petroglyph did was claim that he wrote them! (Or do you have a different definition of "plagiarism" than the rest of the world?)

96faktorovich
mayo 5, 2022, 9:29 pm

>78 Keeline: You are all intentionally blind to every one of my actual arguments, as you keep preferring to make up what you imagine I am arguing instead.

In this case you have missed the joke the Workshop was employing. Yes, I know the term "Holy Ghost", and the Workshop subverted this term (across the century when they ghostwrote) to apply it to this Holy Ghost writing or the Ghost of the Deceased writing posthumously with help from a "Ghost-Writer". This joke is not only lost on you, but it was missed by scholars of this period for a few hundred years.

Before 1921, the term was revived as early as 1889 to describe "Miss Carrol" as a "ghostwriter": https://www.google.com/books/edition/Proceedings_of_the_Botanical_Section_in/WGe...

In 1895, Hawthorne is described lurking like a "ghost writer" of short stories: https://www.google.com/books/edition/Harper_s_Round_Table/xFQ_AQAAMAAJ?hl=en&amp...

A man describes his profession as a "ghostwriter" in court proceedings in 1896: https://www.google.com/books/edition/Z%C3%A9linde_comedie/srgvNbywffEC?hl=en&amp...

It appeared in a non-fiction story called "The Ghost Writer and His Story" by Graves Clark (1919), where it was applied to a hack writer of ghost stories: https://www.google.com/books/edition/The_Editor/M58uAQAAIAAJ?hl=en&gbpv=1&am...

The term "'ghost' writer" is also used in Publishers Weekly (1916): https://www.google.com/books/edition/The_Publishers_Weekly/hygzAQAAMAAJ?hl=en&am...

As you point out, the only reference to a "writer" I found on the page you liked to is: "Founded Syndicate in 1905 to hire ghostwriters to write books from his outlines." There is no hint of his application of the term "ghostwriter" in the books he had ghostwritten for him.

Yes, an "author" is generally the copyright holder, or the byline that is used in obtaining copyrights. However, if a work is pirated, the stated "author" might not be the legal owner of copyrights. And there are several other complex layers of meaning to the act of writing a text without receiving byline-credit for it, and all of these are best summarized with the term "ghostwriter". You have not suggested an alternative with equally multifaceted applications.

97faktorovich
mayo 5, 2022, 9:31 pm

>81 Petroglyph: In brief, I used the 2% figure to demonstrate how a couple of previous computational-linguists used misleading and confusing statistical figures. You only read the confusing figure portion of the paragraph, realized that it was confusing, and thus leaped to the conclusion that I was the one who had created confusing data.

98faktorovich
mayo 5, 2022, 9:34 pm

>82 Petroglyph: Yes, I designed a method with very specific steps that are all absolutely necessary to arrive at an accurate attribution. There is no shortcut without designing a program to execute all of these steps for the user, which would become progressively harder to build the more steps you want it to do for you (such as cleaning a file of glitches in the transcription, or checking the birth and death years of all of the bylines from the relevant period in the relevant genre(s)).

99lilithcat
mayo 5, 2022, 9:41 pm

>96 faktorovich:

A man describes his profession as a "ghostwriter" in court proceedings in 1896: https://www.google.com/books/edition/Z%C3%A9linde_comedie/srgvNbywffEC?hl=en&amp....

Wrong. That case is from 1940: https://casetext.com/case/cassidy-v-gannett-co-inc

100Petroglyph
Editado: mayo 5, 2022, 9:50 pm

>94 faktorovich: After I informed Petroglyph that I viewed his posting of designed pages from my book as a plagiarism, he posted at least one other different page

Are you referring to the page in >52 Petroglyph:? I already posted that page in >17 Petroglyph:. That screenshot was one of the things you were complaining about.

In >52 Petroglyph:, I explicitly said "Note: this screenshot is for the purposes of criticism." Like, right next to it.

But I guess that "must complain about something" and "I IS VICTIM" being decoupled from actual facts is a persistent trend with you, so I'm not surprised. Lies, Faktorovich. Lies.

101Petroglyph
mayo 5, 2022, 9:48 pm

>97 faktorovich:

K.

102Petroglyph
mayo 5, 2022, 9:50 pm

>98 faktorovich:

Your method only works if you go through all the steps and take on board all the confirmation bias, apophenia, creative counting via red-colour method, irrelevant tests, data-obliteration via 1/0, and all the other Greatest Hits, and if you have a sufficiently gradual creep in what the six ghost-writers' styles represent so they can be widened in small steps and accommodate an ever-increasing body of work. I suspect that, if you were to put actual ranges (of % passive voice, clout, adjectives, lexical density and all the others) to your six authors, that you'd end up with a set of data rows that would not unambiguously distinguish between the six.

Prove me wrong. Show me that table with six distinct authorial signatures as "evidenced" by your data.

103faktorovich
mayo 5, 2022, 9:51 pm

>89 Keeline: "Nom de plum", or pen name, is the same as a pseudonym, but none of these have been changed into verbs such as "ghostwrote", so they are not suitable for broad application in a book about a Workshop of Ghostwriters. It would not even be grammatical to say a Pseudonym(s?) Workshop. I explained already that the Workshop used this term in the "Nashe"-byline and cited several other examples of its application. The Workshop started many ideas that appear "new" to scholars today; for example, Verstegan's "Restitution" is credited by some with starting the "Saxon"/"Anglo-Saxon" origins (that morphed eventually into the Aryan concept) concept, whereas other scholars credit much later scholars. Before Verstegan, "Saxon" mostly referred to the people of the Duchy of Saxony. There are many other points, I raise across BRRAM, and you guys seem to be glued to "ghostwriter", without even at least reading my previous explanations why this was indeed the Workshop's humorous invention.

104Stevil2001
mayo 5, 2022, 9:55 pm

>59 faktorovich: "And it is absolutely absurd to demand for a computational-linguistic method to be 'easy'."

"I did follow instructions to adjust my employment of R/Stylo until I realized that the program is designed to be inaccessible and unusable to a general user"

You couldn't make this up!

105faktorovich
mayo 5, 2022, 9:55 pm

>90 Petroglyph: I do not "normalize" any of the data before turning it into a binary and thereby normalizing it. I enter texts into automated counting systems that return a frequency of punctuation marks or parts of speech, etc. The parts-of-speech numbers are given as a percentage of all word-types in the text. There is nothing "normalized" about a number like "10.44 %" of verbs, as this is not a number between -1 and 1. You have no idea what you are yelling about, so figure out what you are trying to say, or stop yelling.

106faktorovich
mayo 5, 2022, 9:59 pm

>95 lilithcat: You are right, it is piracy. This typo is a testament of how emotionally impacted I am by somebody republishing or pirating my work without permission, and then attacking me for pointing out that this action deeply troubles me, as I find it to be technically illegal.

107Petroglyph
Editado: mayo 5, 2022, 10:03 pm

>96 faktorovich:

For "Miss Carrol" I get "no preview available", so I can't check that.

Hawthorne as a "ghost writer" is an artefact of the text being in two columns. The actual sentence, at the bottom of the left-hand column, is "lurked like a ghost among the old church-yards".

108faktorovich
mayo 5, 2022, 10:04 pm

>99 lilithcat: Wow. You guys are seriously committed to finding mistakes in every word I write. I posted 5 examples that Google Books judged to have been published prior to 1921, to correct one of your claims about the first application of this term, and you looked up all of them to find the one that was incorrectly labeled by Google with the wrong year of publication? You should contact Google Books to file a complaint. The other examples are still valid cases of earlier applications of this term that whoever claimed the earliest usage was in 1921 was not aware of. We have been able to find in this thread usages as early as 1881 and 1896, and yet these are not mentioned in most definitions of the term "ghostwriter", so it is entirely likely that the Renaissance Workshop used variants of the term "ghost-writer" without these registering in any dictionaries or discussions about this term.

109faktorovich
mayo 5, 2022, 10:07 pm

>100 Petroglyph: Including a note that you are pirating a text in "fair use" does not counter the act of piracy, but rather stresses that you are aware your action is a violation and that you are attempting to get away with it by using an empty descriptor.

110faktorovich
mayo 5, 2022, 10:10 pm

>102 Petroglyph: You have previously stated that you have stopped repeating that I should create a nonsensical summary table that simplifies my data, and now you have proven your previous statement to have been a lie by again repeating this nonsensical request. There is no "data-obliteration" by applying the binary division, as it achieves the exact "normalization" you were just insisting on a few posts ago.

111Keeline
mayo 5, 2022, 10:12 pm

>94 faktorovich:

I viewed his posting of designed pages from my book as a plagiarism

I would note that plagiarism is not illegal; it is merely a violation of some codes of ethics at universities and some professions. It is defined as passing another person's work off as your own.

Perhaps you are conflating this with copyright infringement. However, if you are going to represent your company in legal matters, you should learn the difference.

James

112faktorovich
mayo 5, 2022, 10:13 pm

>107 Petroglyph: You're right, that's 2 out of six. How about you explain the other 4. I challenge you to find something wrong with all of them. I just briefly found some curious examples of "ghostwriter" usages in response to your request. I have never reviewed the application of this term in the 19th or 20th century before, as these earlier usages are irrelevant, when the term I am borrowing is from the Renaissance.

113paradoxosalpha
mayo 5, 2022, 10:19 pm

>96 faktorovich: In 1895, Hawthorne is described lurking like a "ghost writer" of short stories: https://www.google.com/books/edition/Harper_s_Round_Table/xFQ_AQAAMAAJ?hl=en&amp....
It appeared in a non-fiction story called "The Ghost Writer and His Story" by Graves Clark (1919), where it was applied to a hack writer of ghost stories: https://www.google.com/books/edition/The_Editor/M58uAQAAIAAJ?hl=en&gbpv=1&am....
In 1895, Hawthorne is described lurking like a "ghost writer" of short stories: https://www.google.com/books/edition/Harper_s_Round_Table/xFQ_AQAAMAAJ?hl=en&amp....
The term "'ghost' writer" is also used in Publishers Weekly (1916): https://www.google.com/books/edition/The_Publishers_Weekly/hygzAQAAMAAJ?hl=en&am....

In all three of these, the terms are expressly used to denote writers on the topic of supernatural ghosts or spectral entities. None of them have anything whatsoever to do with feigned authorial attributions.

114faktorovich
mayo 5, 2022, 10:23 pm

>111 Keeline: Being so harassed by a barrage of insults that I make a typo and confuse "plagiarism" with "piracy" is proof of emotional distress. I would not make this mistake when calmly designing a court filing. The terms are related: "plagiarism can warrant legal action if it infringes upon the original author’s copyright, patent, or trademark." While Petroglyph did not put his name on the pages he pirated, he did not put my name on it either, as he did not cite the page numbers, or chapters from which he took these pages, and he did not even include a footnote that specifically identified these pages as coming from my book. Only at the bottom of his post did he cite the volume in general. So, I was not incorrect to call it "plagiarism"; for example, synonyms of "piracy" include "plagiarism": https://www.thesaurus.com/browse/piracy There is really no good term for stealing a book equivalent to "plagiarism", as copying, infringement, piracy etc. do not really convey the precise intended meaning. "Plagiarism" can be unintentional: https://www.ox.ac.uk/students/academic/guidance/skills/plagiarism So by failing to include my name while posting my work, this can be an act of unintentional plagiarism, including if most readers would not assume that you were attempting to claim you wrote my book.

115faktorovich
mayo 5, 2022, 10:29 pm

>113 paradoxosalpha: The description of the author specifically as a "hack" writer before she is described as a writer of "ghost stories" humorously explains the title of the story is about hack-writing or ghost-writing and not merely about writing about ghosts. And the Publishers Weekly article similarly stresses that the "ghost writer's" work occasionally actually has "real artistic merit", a phrasing that is commonly applied to distinguish hack/ ghost writers from highbrow or true "writers".

116Petroglyph
Editado: mayo 5, 2022, 10:46 pm

>112 faktorovich:

>108 faktorovich: "I posted 5 examples that Google Books judged to have been published prior to 1921, to correct one of your claims about the first application of this term, and you looked up all of them to find the one that was incorrectly labeled by Google with the wrong year of publication?"

(implying it's wrong of us to do so. Taken out of context, this is something a legitimately upset person would say)

Also >112 faktorovich:: I challenge you to find something wrong with all of them"

(stating we should do so; taken out of context, this is something a defiant person would say)

Also >112 faktorovich: "I just briefly found some curious examples of "ghostwriter" usages in response to your request. I have never reviewed the application of this term in the 19th or 20th century before, as these earlier usages are irrelevant, when the term I am borrowing is from the Renaissance"

(she wasn't serious, you guys, it was just a quick search don't take it seriously). Back-pedaling time

>108 faktorovich: "The other examples are still valid cases of earlier applications of this term that whoever claimed the earliest usage was in 1921 was not aware of. We have been able to find in this thread usages as early as 1881 and 1896,"

(she's still right, and these claims were serious. Taken out of context: this is something a confident expert would say.)

117paradoxosalpha
Editado: mayo 5, 2022, 10:43 pm

>115 faktorovich: The description of the author specifically as a "hack" writer before she is described as a writer of "ghost stories" humorously explains the title of the story is about hack-writing or ghost-writing and not merely about writing about ghosts.

"Hack" is a pejorative label for a writer who produces in quantity and on demand. Although there is certainly an overlap between ghostwriters and hacks, there is nothing like an equivalence between the terms. Writers of supernatural stories for the popular market have been frequently derided as "hacks" when honestly using their own bylines.

118Petroglyph
Editado: mayo 5, 2022, 10:50 pm

>105 faktorovich:

A note about normalization.

For texts that do not have similar lengths, counting absolute frequencies and judging texts as "similar/not similar" based on those absolute frequencies leads to wrong results. Obviously. I shouldn't have to point this out. And Faktorovich indeed makes use of relativized frequencies (though not always!).

So you should at least relativize or even "normalize" your data -- transform absolute frequencies by relating them to a shared norm, say, recalculating them to the average frequency per 10,000 or 100,000 words, or per 100 or 1000 sentences. That way, when you compare many different texts of different lengths, all your frequencies for that particular test will have the same baseline. Maternal mortality and other medical causes of death (such as covid-19) are often expressed in incidences per 100,000. This is to accommodate both very frequent diseases and very rare conditions.

Expressing absolute frequencies in percentages is a form of relativizing them: you are recalculating them relative to a shared basis of "100". In scientific results, percentages are often expressed as a range between 0.0 and 1.0¹ (thanks, Faktorovich, for googling that for us in >41 faktorovich:). In corpus linguistics, textual features are often expressed per 100 sentences, per 100,000 words, or even per million. Here is a graph from O'Keeffe et al. (2007, p.14), a book I know Faktorovich has at least briefly glanced at:

The website Analyzemywriting.com, one of Faktorovich's online services, presents the punctuation marks both in absolute frequencies, and terms of per 100 sentences:

In her tables, Faktorovich uses these normalized results. Pace her comments in >60 faktorovich: "a nonsensically duplicating step" and >105 faktorovich:, "I do not "normalize" any of the data before turning it into a binary and thereby normalizing it"

That website also offers the option of normalizing punctuation per 1000 sentences, or even per 100/1000 words:

But that is an extra two clicks.

¹ Normalization between 0.0 and 1.0 would take the form of 0.01 for 1%, and 83% would be 0.83. It's not the 0/1 either/or choice that Faktorovich imagines it is.

119Petroglyph
mayo 5, 2022, 10:57 pm

>118 Petroglyph:

If I stop posting here after this, it's probably because I'll have been speedily deported to the US to stand trial for my ~~plagiarism~~ piracy of analyzemywriting.com's website.

120Keeline
Editado: mayo 5, 2022, 11:00 pm

>108 faktorovich: faktorovich: "I posted 5 examples that Google Books judged to have been published prior to 1921, to correct one of your claims about the first application of this term, and you looked up all of them to find the one that was incorrectly labeled by Google with the wrong year of publication?"

If you looked at any posts after #78, you would see the clarification that I and others made to find earlier examples. In #78 I expressed doubt about the 1921 claim. Stuff on the Internet is often wrong so each claim needs to be assessed to see if it is valid.

Perhaps this blog-like article is not authoritative. But the author claims that "ghostwriter" was coined in 1921 ( https://www.thewritersforhire.com/whats-the-origin-of-the-term-ghostwriter-and-h.... ).

It seems like it would be earlier and certainly the concept is much older, just not the word.

It was not my claim. It was the claim of the author of that page. They were wrong.

So the only partially successful attempt to show earlier examples of the word or concept is not especially necessary since it only joins the content of replies #83, #88, #91, and #92. Welcome to the party.

When someone claims that they never make an error, when the conversations show many examples of provable errors (not merely differences of opinion), how much confidence are we to have in the body of work being discussed here?

Now since this is starting to get emotional, rather than focusing on the theory, methods, and results of these extraordinary attribution claims, I suggest taking a break from the forum. It will still be here later on.

James

121prosfilaes
mayo 5, 2022, 11:05 pm

>96 faktorovich: the term was revived as early as 1889 to describe "Miss Carrol" as a "ghostwriter":

If you read the page, available at https://babel.hathitrust.org/cgi/pt?id=hvd.hx7rw7&view=1up&seq=208&q... , you'll see it mentions a book published in 1940. Obviously this is misdated by Google, which is not infrequent. It's irrelevant; whether ghostwriter was first used in 1889 or 1921, it's still a word created long after the British Renaissance, which makes your stretched interpretations of "ghost" in those works completely untenable.

122Petroglyph
mayo 5, 2022, 11:07 pm

Actually, that got me thinking.

Faktorovich uses analyzemywriting.com as a source for her data. Not occasionally, but systematically. She puts that in books and tries to sell them. Can she do that?

From their "terms" page, emphasis added:

You must not conduct any systematic or automated data collection activities on or in relation to this website without www.analyzemywriting.com's express written consent.
This includes:

scraping

data mining
data extraction

data harvesting

'framing' (iframes)

Article 'Spinning'

{...}
You must not use this website for any purposes related to marketing without the express written consent of www.analyzemywriting.com.

Faktorovich, do you have their express, prior, written consent to use their stuff in the books you sell on your website?

123lilithcat
mayo 5, 2022, 11:13 pm

>121 prosfilaes:

Yes, it looks as though Google digitized several different and unrelated items, yet catalogued them all under Proceedings of the botanical section in memory of Dr. Asa Gray (Brackets around "of the botanical section" removed by me because they created a weird touchstone.)

124Petroglyph
mayo 5, 2022, 11:17 pm

>120 Keeline:

Peace, Keeline. I appreciate your calm, rational, and knowledgeable contributions.

125indeedox
mayo 6, 2022, 1:52 am

I just would like to thank everybody posting in this thread!
I'm learning a lot.

126anglemark
mayo 6, 2022, 3:11 am

>125 indeedox: I'm learning a lot.

I think every single person reading this thread is. Well, with one exception.

-Johan

127andyl
mayo 6, 2022, 6:00 am

>107 Petroglyph:

You couldn't make it up. This is such a basic mistake, a mistake I wouldn't expect an 11 year old to make, it can only be an attempt at a deliberate misleading statement - despite faktorovich knowing that people here are more than willing to double check her statements.

128anglemark
mayo 6, 2022, 8:59 am

>127 andyl: No, I think it's just a result of searching GBooks really quickly for something that could "prove" her theory, and posting the search result without checking it. The autogenerated GBooks preview says

pastures and meadows , hills public , which had become familiar with Hawthorne as a and valleys and wild - pine groves , and lurked like a ghost writer of short stories , now saw that it had been entertaining a genius unawares .

You or I would have checked the page in the book to see what it really said, but that requires an extra click.

I don't think using the term "ghostwriter" is a problem in itself, because if you are talking about the group and their (hypothesized) function, you'll need a term for it. (The same thing goes for "byline", which to me is a much weirder choice.) What is ~~completely bonkers~~ more problematic about "ghostwriter" is the arsy-versy etymology.

-Linnéa

129andyl
mayo 6, 2022, 10:43 am

>128 anglemark:

Ahh OK, but she probably ought to have tracked the primary source - especially for something like ghost writer, and ghost, where the are multiple possible meanings.

Personally I would be amazed if there were not people performing the role of a 'ghostwriter' in renaissance England even though they didn't call it that. Which of course isn't expressing any sort of belief in Faktorovich's 'research'.

130faktorovich
mayo 6, 2022, 10:46 am

>118 Petroglyph: "For texts that do not have similar lengths, counting absolute frequencies and judging texts as "similar/not similar" based on those absolute frequencies leads to wrong results." As I argued in a previous post, short texts (under around 1,000 words) lead to less precise attribution results (with any testing method).

There is nothing unique about "absolute frequencies" that would make this measure uniquely useless when applied to short texts; it is as problematic or helpful as any of the other standard measures, as all units in a short text shrink at a similar rate (there are fewer punctuation marks and words overall).

I am now sure you do not understand what the terms "absolute frequency" and "relative frequency" mean, as what you are saying is nonsensical. All of the tests I apply that use frequency use relative frequency (such as the percentage of nouns, or commas per 100 sentences); it would be nonsensical if I recorded the absolute number of commas in a text when the text-length vary so widely. You argue I "should" use per-100 sentences measures, as if you have not read that I indeed use this measure (as I explain across BRRAM's Volumes 1-2). The website I use does not give the "option of normalizing punctuation per 1000 sentences"; it gives the option to create relative frequencies; refer back to the definition of "normalizing" I provided; the website in question refers to this as "per X" not "normalized" on a -1 to 1 scale. And whatever you are trying to say here is statistically nonsensical: "Normalization between 0.0 and 1.0 would take the form of 0.01 for 1%, and 83% would be 0.8." Just so you will understand the difference between "absolute" and "relative", here is a definition from Investopedia: "Absolute Frequency vs. Relative Frequency: Absolute frequency can be the starting point for a more nuanced statistical analysis. Relative frequency, for example, is derived from absolute frequency. When the absolute frequency of values is tracked over the entire trial, the absolute frequency for a particular value can then be divided by the total number of values for that variable throughout the trial to get the relative frequency. The relative frequency is what we most often reference, whether it is the winning percentage of our favorite sports team or the percentage of fund managers that beat the market. Unlike absolute frequency, relative frequency is usually expressed as a percentage or fraction rather than a whole number." https://www.investopedia.com/terms/a/absolute_frequency.asp

And all attribution methods judge "texts as 'similar/not similar'"; that's what attribution is determining which texts are similar to each other, and which are not similar or by different authors.

131paradoxosalpha
mayo 6, 2022, 10:56 am

>130 faktorovich: The website I use does not give the "option ...

I will be interested in your reaction to >122 Petroglyph:

132faktorovich
mayo 6, 2022, 11:52 am

>122 Petroglyph: I immediately sent the following message to AnalyzeMyWriting:

---Dear Administrator:

I have developed a new computational-linguistics author-attribution method that includes using your website (and other websites) to measure certain elements of texts, such as passive voice. Here is the series where I mention your tools: https://anaphoraliterary.com/attribution/ And here is a free copy of the series for your review... I cite your website in Volumes 1-2. I believe using and advertising your tools to other potential users is a positive thing for you. Let me know if you object to me mentioning your website, or if you might want to collaborate on developing a platform that combines a number of your tests into a tab on your page to assist other attribution researchers. Thanks in advance for your time. Sincerely,

Anna Faktorovich, PhD, Director, Anaphora Literary Press---

I did not send it because I agree with your interpretation of their rules, but rather just because they and you invited an email, and I felt it was the polite thing to do. As for the rules, there is nothing in their policy that bans free advertising from users who write about their tools being helpful to attributors. You again do not understand the terms they are using and leaping to the assumption they are relevant. They are forbidding activities that "causes, or may cause, damage to the website" or to other people. Such as "data mining" or collecting data about private users to sell it to corporations who want to sell them stuff, or even to criminals who want to target them for fraud. It would be absurd if a website like Analyze did not allow the usage of their tools for the main purpose it is created for, or for calculating linguistic measures in texts. Another term they use in those terms: "Data scraping is a technique where a computer program extracts data from human-readable output coming from another program." This is the covert extraction of data that sneaks into somebody else's system illegally and steal things that are not publicly accessible. The tools they provide are public and open to all, so there is no sneaking, or hacking involved in utilizing the tools in the manner they are designed to be used. When they say the website cannot be used in a manner related to "marketing", they mean in a way related to the other activities on this list like "data mining" or it cannot be used to phish or to harvest and process personal data to sell data about thousands of people.

133andyl
Editado: mayo 6, 2022, 1:16 pm

>132 faktorovich:

Another term they use in those terms: "Data scraping is a technique where a computer program extracts data from human-readable output coming from another program." This is the covert extraction of data that sneaks into somebody else's system illegally and steal things that are not publicly accessible.

The definition of data-scraping in quotes is correct.

However it is not 'covert extraction' it isn't about sneaking into somebody else's system illegally or stealing anything. In fact the whole point is your are collecting data in a form you can use elsewhere from a source that you can see as a member of public or your own logon. It is not automatically illegal in any jurisdiction (although some jurisdictions to have laws about scraping email addresses). It can be a contravention of a website's terms of use though.

134Keeline
mayo 6, 2022, 1:22 pm

When I was doing searches for terms like "ghost writer", "ghostwriter", "ghost writing", and "ghostwriting" in four newspaper databases to which I have subscriptions yesterday, I found some interesting false hits.

In some cases the nature of the search engine allows there to be words between the two I defined even though I used quotes. This becomes more like a

ghost near2 writer

search for the search engines that support that syntax that says something like "within two words". It is a little less helpful but the inaccurate nature of OCR and newspaper text layout means that this is sometimes necessary to catch most hits. They assume that a human will look at the result and see if it is relevant.

In some British newspapers of the mid-19thC had an often-reprinted ghost story where they would talk about the ghost and within a couple words mention "the writer" of the article/story. It clutters up the listings.

In the 1890s and first decade of the 20thC it is fairly common to see "ghost writing" (usually two words, sometimes hyphenated) to refer to collecting intelligence from the dead by entering a trance and writing what comes to the person with pen or pencil in hand. Spiritualism was popular in this era and there were some famous people who embraced it and others who worked hard to debunk it.

Decades later, after Albert Payson Terhune (December 21, 1872 – February 18, 1942) died, his second wife and widow produced a book that she claimed was a spiritual transmission of his thoughts. Today it is treated as some of his early writing with her notes to prepare it for publication.

In 1913 the president of the American League of baseball expressed concern about players who were going to be identified as writers of articles, particularly the key players of the teams participating in the World Series. This was, of course, ghostwriting in most or all cases. Some sports writer did the actual writing and the player was paid for the use of the name. Whether there were any interviews to collect the players' reflections is not noted. But the president of the league, Ban Johnson (almost like Ben Jonson), disliked the practice and tried to ban (sorry, an unintentional pun) the practice through the league rules and contracts.

Around this time, Christy Mathewson (August 12, 1880 – October 7, 1925) had published articles and some books between 1910 and 1919. A few names have been suggested for ghostwriting the various work, including two sports writers, Bozeman Bulger and John Nevill Wheeler, and one series book author for the brief series of baseball stories, Ralph Henry Barbour. The latter may be because Barbour wrote stories about several sports, including baseball, and had a football series for the same publisher at that time with similar titles. Yet, lists of Barbour's books in reference volumes which are generally reported by author do not include them. It is an area that may be a fun exploration of authorship attribution techniques once the texts can be gathered as well as some baseline examples of the writers. Barbour baselines are easy but the other writers may require more effort since they mainly wrote for newspapers. On one of these Christy Mathewson books had a special edition with a signature that did not match authenticated signatures by the player on contracts, letters, and baseballs.

But this all shows that it is not enough to look at a search results page. Each item has to be scrutinized. In some cases the date associated on Google Books or a newspaper database is wildly wrong and the date on the publication has to be found. For example, a periodical bound volume scanned by Google books may have a year that corresponds with the first year for the publication and not the first (or last) issue in the bound volume. The metadata in the newspaper databases can be wildly wrong as I discovered when looking for Disneyland articles. In general the usage should be no earlier than 1952 and mainly from 1954+ for the theme park. There were some other uses. But if you see something from the 1910s, or 19thC, then it is a case where the metadata is wrong. The date on the first page of the newspaper is generally reliable and should prevail unless there is one of the occasions when they made a typographical error (in the original sense of the term) in setting the banner for the paper for that issue. But usually this is not years off as the metadata problem reveals.

The close columns of newspaper and some magazine print can cause OCR systems to be confused to jump words across a column gutter and lead to confusing results.

It is an imperfect world.

James

135Petroglyph
mayo 6, 2022, 4:49 pm

>130 faktorovich:

When you quote me, you conveniently leave out this sentence: "And Faktorovich indeed makes use of relativized frequencies (though not always!)". And, later in >118 Petroglyph: "In her tables, Faktorovich uses these normalized results.".

So anything you wrote in >130 faktorovich: that rests on the assumption that I accuse you of using mainly absolute frequencies is false.

It's a very convenient omission. One that allows you to write a haughty, dismissive, imperious post that pretends you know what you're talking about and write as though others have not understood. It's one of your favourite kind of conceits. But if you have to resort to quote-mining to pull it off, you're probably wrong.

that would make this measure uniquely useless when applied to short texts

More reframing my post into something it did not say. "Varying lengths", I said. Such as the Mansfield, Emma, Northanger discussion.

I am now sure you do not understand what the terms "absolute frequency" and "relative frequency" mean, as what you are saying is nonsensical. All of the tests I apply that use frequency use relative frequency

I, again, refer you to >118 Petroglyph:, and to the top of this post where I point out the bits that you avoid quoting. The bits you avoid quoting make this assertion of yours wrong. The implication that I claim you use mainly absolute frequencies is also wrong. See the bits you avoided quoting.

You argue I "should" use per-100 sentences measures, as if you have not read that I indeed use this measure

In >118 Petroglyph:, I said "The website Analyzemywriting.com, one of Faktorovich's online services, presents the punctuation marks both in absolute frequencies, and terms of per 100 sentences: {image} In her tables, Faktorovich uses these normalized results."

If you are responding in detail to post >118 Petroglyph:, how is it possible that you have missed this?

Just so you will understand the difference between "absolute" and "relative", here is a definition from Investopedia

If you search page one of this thread for "absolute frequenc" and "relative frequenc" (to catch both "frequency" and "frequencies"), you'll find I have been arguing you should uses relative frequencies for your word counts. Relativized against your entire corpus. Not most frequent words/phrases in one particular text (absolute frequency). But most frequent in a text relative to a corpus. Did you only google those terms just now? If so, I can't say I'm surprised...

The fact you googled a finance and investment website for some utterly basic concepts in corpus linguistics is yet another indication of your complete out-of-your-depthness.

The website I use does not give the "option of normalizing punctuation per 1000 sentences"; it gives the option to create relative frequencies; refer back to the definition of "normalizing" I provided; the website in question refers to this as "per X" not "normalized" on a -1 to 1 scale

Corpus linguistics does not usually follow statistical standards and notational conventions from STEM fields, no. If you were familiar with the field you would know that. But normalizing "per 100,000" words, say, or per million words (as O'Keeffe et al do in that screenshot I posted in >118 Petroglyph:) is entirely unremarkable in corpus linguistics. Or normalizing the number of, say, commas per x number of sentences or per x number of words.

There are also other ways of normalizing things. There is no such thing as "one normalization to fit all." Expressing the number of commas in a text as a percentage or on a scale from 0.0 to 1.0 would be, well, useless? The proportion of all single characters? Comma's as 62% (or 0.62) -- of what? Total punctuation mark use? Expressing it in terms of 100 or 1000 sentences, though, that is a measure that tells you something about the average internal complexity of sentences. So that is a sensible normalization in this case.

But you've demonstrated numerous times during this thread that, once you form a "brisk impression" of how something works, you take that impression literally and use it as a black/white all/nothing right/wrong mental category. It becomes an easily-applied test with which to declare someone else's statements "nonsensical" and your own statements "correct", though only in the most superficial, literalist and unhelpful kind of way.

That's your way of doing things, though. Most other people don't think like that. Or act like that.

You know who really doesn't act or think like that? A scholar.

136faktorovich
mayo 6, 2022, 8:58 pm

>134 Keeline: Out of the initial entries I found one stands out as the likely source that popularized the term in later appearances, "The Ghost Writer and His Story" by Graves Clark (1919), and others in this thread have mentioned other non-glitchy appearances as early as the 1880s. Those are some interesting stories you have dug up while exploring the appearances of this term; so I am glad I inspired you to look further in this direction.

137faktorovich
mayo 6, 2022, 9:14 pm

>135 Petroglyph: I left it out because it was repetitive, but, "relativized frequencies (though not always!" is still false, as my frequencies are indeed "always" "relative".

I have not missed anything in your replies. Your comments on these topics, and your attempts at rebuttals have been deeply nonsensical, so there is no rational defense you can come up for them, and so you are relying on adding still more nonsense.

Looking up terms in a dictionary before discussing them in public is an essential part of being a professional linguist/ English-language-researcher. Anybody who does not check their terms before defending a position that requires a complete understanding of terms, is not a scholar who cares about the meaning they are conveying.

"Comma's as 62% (or 0.62) -- of what?" Commas can be counted as a percentage of all punctuation marks, but this is not the type of count that would be useful (as there are some rare punctuation marks or variants in punctuation that are likely to introduce glitches). My commas-per-100-sentences test has a range between 413 and 1075; this is not a percentage; it is a "relativized frequency" that is normalized during the binary conversion process into 1s/0s.

Whenever I detect nonsense in texts that are or pretend to be scholarship, I always call it by this term, as there is no better term to describe words that are used incorrectly and thus do not carry any sense. I have been reviewing books with this degree of honesty since 2009, so it is my instinctual position. It would be highly unusual for me to ever stop myself from pointing out a mistake of this sort, and especially when there are so much nonsense in a text that the whole thing is too nonsensical to allow anybody else to attempt to read it without a warning.

138spiphany
mayo 7, 2022, 4:42 am

>137 faktorovich:: "Looking up terms in a dictionary before discussing them in public is an essential part of being a professional linguist/ English-language-researcher. Anybody who does not check their terms before defending a position that requires a complete understanding of terms, is not a scholar who cares about the meaning they are conveying."

The point being made in the post that you responded to was not criticism of the fact that you looked up a term, but rather 1) that you needed to look up the term, which should already be familiar to someone doing corpus-based linguistic analysis and 2) that the source you chose to back up your findings is from a completely different discipline.

A dictionary is a good starting place when one encounters terminology with which one is not familiar.

However, there are many words that have specific technical meanings in particular contexts. Different disciplines may have their own particular ways of using some of these terms that differ from common usage or the usage in other disciplines. Often these words convey concepts that require extra-linguistic knowledge--scientific or mathematical models and theories, for example. A dictionary cannot and is not intended to provide this sort of knowledge. In such cases, the confused reader would be better advised to consult a textbook or encyclopedia on the subject matter in question.

***This service announcement brought to you by someone whose work regularly requires her to read and understand texts on subjects she knows little about.***

139faktorovich
mayo 7, 2022, 11:35 am

>138 spiphany: Even with a filmographic memory, it would be impossible for any human to remember the precise definition of every word in all dictionaries (up to a million words in usage). Thus, any critic that never looks up terms, but instead relies on their intuition or vague recollection of terms is likely to misuse them, as Petroglyph has done in his misapplication of terms like "normalization". I and everybody else in this discussion are likely to be "familiar" with terms like "normalization", but linguistic debates are not solved on vague familiarity, but rather with precise knowledge of the rules of grammar, definition of terms, etc. Anybody can discuss a subject based on their "familiarity", but experts must discuss subjects based on precise knowledge, which is always found by checking sources, including checking dictionaries.

Statistics is a field that covers a wide array of disciplines. I first learned statistics during my economics degree classwork, so Investopedia is a familiar source that consistently provides useful definitions with mathematical elements that help to visualize the options of a mathematical problem. The statistics and the statistical terms do not change in meaning when they are applied to investment calculations or to computational-linguistics. To "normalize" data still means the same thing, especially since the process of normalization is all about taking whatever type of data units the raw data is in, and altering it to a numeric system where different types of measurements can be compared to each other. There are no classes I have heard of purely designed for computational-linguistics statistics, so most who want to be trained in this subject have to step outside of linguistics and computing and into fields like economics that offer textbooks, dictionaries and classes on statistics as the primary subject.

Non-quantitative philosophical theories, such as "extra-linguistic knowledge", are irrelevant when a scientist is working to arrive at precise quantitative results by applying the most relevant statistical tools.

I am particularly interested in the subject of investment statistics at the moment because next week I am doing a special review project for PLJ where I am reviewing textbooks (that publishers sent to me for review) about investment, and a few of them are purely quantitative, or describe the advanced mathematical/ statistical tools investors can use to make investment decisions.

140Keeline
mayo 7, 2022, 1:53 pm

>139 faktorovich:

Even with a filmographic memory

I have to admit that this expression is new to me. I have certainly heard of the more colloquial "photographic memory" that shows up in popular media and "eidetic memory". In attempting to find "filmographic memory" online I am seeing some of the other terms coming up but my browser spell check is not familiar with this word. I don't say it isn't used by someone but it is not a common expression, nor does it seem to be a technical term with a specific meaning. For the other two I found the usual Wikipedia entry (detailed) and this page.

https://www.betterhelp.com/advice/memory/difference-between-eidetic-memory-and-p...

Looking up words is something every scholar might (or should) do when there is uncertainty of the definition. Whether they cite the definition found is another matter which depends on the purpose and audience of the communication. However, I agree that when looking up a word, a generic dictionary often falls short, especially when one is looking for a scientific or mathematical meaning. Likewise, if one is looking for a usage from a past decade or century, modern dictionaries may emphasize current usage which can be misleading. We can all think of words that have been transformed in the past 50-75 years to have entirely different meanings.

Take the word "syndicate" which was used in the late-19thC and early 20thC as a group of people preparing content (articles or stories or pictures) for newspapers and magazines. Thus the Stratemeyer Syndicate who was involved in producing books for publishers mainly in juvenile series fit at the time it was established in 1905. But, with the rise of Organized Crime in the Prohibition era, the word "syndicate" became a "crime syndicate" and that became the dominant definition.

One of the employees of the Stratemeyer Syndicate described an interview that the head, Harriet Stratemeyer Adams, had with some vaguely-identified men in suits who wanted to know the nature of the business conducted by the Stratemeyer Syndicate. She explained, with as little detail that she cared to, that they wrote and produced these children's series books. At the end of the interview she asked them if they were from the F.B.I. They seemed a little surprised by this question, which was apparently close to the truth. When asked why she thought so, she replied, "well, I do write mysteries." The prevailing thought is that these investigators thought that the combination of a German name and the term "syndicate" might be something darker than what it really was.

My reaction to citing an investment dictionary/encyclopedia for a term used in statistics for linguistic computing was skeptical as it has been expressed by others. I know that words like "average" are often ill-defined in popular use but this and related terms (mean, median, standard deviation, etc.) have very specific meanings in statistics. The two-part question raised when using a source like this is — is it the best source for an authoritative answer; are there differences in the way the term is used in investments than it is in statistical analysis?

James

141faktorovich
mayo 7, 2022, 8:57 pm

>140 Keeline: Indeed, I use the term "filmographic memory" because I mean something different from "photographic memory", as I recall things as if I am watching a film, though at times it can be more like looking at a photograph, or a series of photographs. It is indeed essential to look up such terms when they are even vaguely unfamiliar, so that we understand an author's intended meaning. Citing definitions has been essential in this thread because whenever I object without quoting a source such as a dictionary as the source for my objection; other writers in this thread simply say that my interpreted meaning is wrong; and the meaning can only proven to be correct by citing a dictionary meaning to avoid a debate about a basic term. Since I am currently translating a Dictionary-and-History via "Restitution", I am vividly aware of the shades of changes of meaning over time. But a term like "normalization" is not an ancient concept that is out of date, and so modern dictionaries, and especially those that specialize in statistics/ investment are suitable for this task. The problem is not that I used a dictionary to define these terms, but that Petroglyph misused these terms or used them in a nonsensical manner that is contrary to all definitions of them across all known dictionaries (specialized and non-specialized). Arguing that the problem is my source-checking or diligence with finding dictionary meanings is similar to arguing that the problem is in whoever outed the abortion opinion, and not the fact that turning abortion into an illegal act is problematic.

142Keeline
Editado: mayo 7, 2022, 11:57 pm

I see a relatively small number of pages thad use "filmographic memory". It is clearly not a common term used by others and it was not defined by you when you used it. Is this the definition you mean as covered on these pages?

https://www.google.com/search?q=%22filmographic+memory%22

James

143prosfilaes
mayo 7, 2022, 10:19 pm

>139 faktorovich: Non-quantitative philosophical theories, such as "extra-linguistic knowledge", are irrelevant when a scientist is working to arrive at precise quantitative results by applying the most relevant statistical tools.

For one, you have told us that "Holy Ghost" refers to the fact that they were ghostwriters, which is extra-linguistic knowledge. For another, we can go back to near the start of last thread for the discussion about the most relevant statistical tools where yours were referred to as "amateur hour". There's a lot of choices of statistical tools, but random websites with hidden code is one of the least trustworthy.

144Petroglyph
mayo 8, 2022, 2:06 am

>139 faktorovich: Re: dictionary definitions of "normalization".

You can't have looked very hard or very carefully. Possibly just enough to make yourself feel in the right and vindicated and whatnot.

From Collins Cobuild English Dictionary:

Normalize
in British English

{...}
2. to bring into conformity with a standard

{...}

in American English

1. to make normal; specif.,
{...}
b. to bring into conformity with a standard, pattern, model, etc.

Derived forms: normalization: noun.

From Merriam-Webster:

Normalize
2. mathematics: to make (something) normal (as by a transformation of variables)

The page at the link behind normal has this as the relevant definition:

3 a: approximating the statistical average or norm

Or even #6:

6: of a subgroup: having the property that every coset produced by operating on the left by a given element is equal to the coset produced by operating on the right by the same element

Therefore, as per Merriam-Webster, dividing the number of commas in both Mansfield and in Northanger by (# sentences / 100) entails normalizing them to per 100 sentences.

Quod Erat Dictionariandum?

145Petroglyph
Editado: mayo 8, 2022, 11:21 am

For this Lunch Break Experiment (tm) and Petroglyph's After-Dinner Stylomety corner, I wanted to take a look at lexical density:

As analyzemywriting.com defines it: "lexical density is simply the sum of the percentages of nouns, adjectives, verbs, and adverbs as stated in the definition of lexical density." In other words: count the proportion of all the content words as a percentage of the entire text, and don't count the function words (such as auxiliaries, determiners, pronouns, prepositions). It is a measure of how many content-words there are in a particular text.

Faktorovich's "newly-invented" "27-tests method" spends 8 of its tests on lexical density, broadly speaking: there's one for the lexical density of the entire text, and then a one test each for nouns, adjectives, verbs, adverbs, prepositions, pronouns, and auxiliaries.

Let's see if she can get reliable results for these tests.

1. What about genre?

Faktorovich happily compares genres (sermons, letters, poetry, plays, prose, philosophical works, etc, etc.). Plays like Hamlet (~32,000 words, ~3900 sentences) are blithely compared to individual poems like Thomas Edwards L'Envoy to Narcissus (511 words; 15 sentences.)

It is well-known that lexical density is a feature that changes with genre. Plays, which are virtually entirely dialogue, are likeley to contain a much higher number of pronouns than, say, philosophical works. Even analyzemywriting.com knows this! They even performed this brief and informal comparison between some random wikipedia entries and some short stories. Their main finding is that "Lexical density, as measured by this website, tends to be higher in informative writing than in fiction on the order of about 7%."

This is, in fact, one of the things you can use lexical density for: differentiating between genres, and between different kinds of texts, and between different styles. If the same person writes a play, and also a theological tractate or something, the fact that those texts will have different lexical densities does not justify the conclusion "must be two different people who wrote this". It's just the concept of "genre". Individual differences in lexical density may play a role, too, obviously. But that's something you have to consider in conjunction with things like register, age, genre, level of education, familiarity with genre, creativity, etc. "Different density means different people" is not a conclusion you can jump to, or, as Faktorovich does, merely assume it's true.

Or, as McArthur et al. (2018) note: "Different registers will often differ in terms of lexical density; generally academic texts have a greater lexical density to informal spoken texts" (q.v. Lexical Density).

Let's look at what Faktorovich makes of this concept. From Re-attribution p. 96 (see reference in >17 Petroglyph:):

The 6th test for Lexical Density had one of the most pronounced clusters. This is a measure of an author’s tendency to be verbose or to compress their thoughts into challenging language (indicating greater lexical density), or to write lightly and with brief strokes (registering as low lexical density). (p. 96)

So no. Lexical density is a property of texts: chatty, spoken language tends to include more sentences like "I gave it to her" (where the lexical words have been underlined, for a lexical density of 1/5 = 20%); and a more technical, written text might put that as "The person was given the book" (for a lexical density of 3/6 = 50%).

Lexical density is not a feature of a person. It is a feature of a text. This is a very important point, and Faktorovich has mistaken the latter for the former.

She continues:

While one might assume that Verstegan’s sermons and speeches were particularly dense linguistically to bombastically express propagandistic positions, they actually predominantly cluster on the lower end of this scale. 9 of Verstegan’s texts took the 9 lowest positions on this range. The lowest among these was Verstegan’s translation of King James Bible that has a lexical density of only 43.69, probably because it was intended to be simple enough to be grasped even by the semi-literate masses. “Elizabeth I’s” speeches follow, then the Marprelate Tracts, and then other rhetorical texts. This pattern is only interrupted by the similarly lexically light Byrd-ghostwritten Addition III of Sir Thomas More. Then, 5 more of Verstegan’s texts follow, including “James I’s” Demonology. The upper-end of the lexical density spectrum is dominated by Harvey and Sylvester. Harvey’s self-attributed Pierces Supererogation has the highest density among the 284 texts at 64.02; Harvey appears to have chosen this piece to represent his own style in 1593 to defend his scholarly credentials around two years after his fellowship at Cambridge ended, and while he might have still been hoping to rejoin academia." (p. 96)

So. Faktorovich assumes that sermons and speeches (i.e. intended to be delivered orally in a largely illiterate society) would have high lexical densities, which is one thing she got arse-backwards; and she associates different lexical densities with different authors (with most of "Verstegan's" texts having low densities). This, too, is arse-backwards.

In other words: Faktorovich has misunderstood what lexical density is, what it measures, the kind of thing it is used to measure, and what kind of genres can be expected to have low lexical densities.

But that's par for the course.

This, in and of itself, makes whatever she makes of the results of her tests unreliable.

What about Early Modern spelling?

Under the section "Assumptions and Limitations", analyzemywriting.com says that "We first note that our calculation of lexical density assumes that a text is written in English. Furthermore, it is assumed that a text is properly punctuated and apostrophes are used correctly."

If this free, online service is calibrated for present-day spelling conventions, that means that it stands to reason that Early Modern spelling conventions, or the lack thereof, may have an impact on the results that this black box spits out. Enough of an impact to validate the results, perhaps?

Let's send up a few trial balloons and see if renaissance spellings can influence lexical density counts.

To begin with, I tried this query:

loved lovèd lov'd

and got this as the result:

So: loved is analyzed as a verb (correctly!); lovèd is analyzed as a noun (incorrectly!); and lov'd is analyzed as an auxiliary (incorrectly! -- Analogous to I'd, you'd, she'd, ...). Yikes! A verb being re-assigned to a noun (merely because of un-modern spelling!) impacts both the percentages for nouns and for verbs, but it does not impact the calculations for lexical density. But a mis-tagging of a lexical verb as an auxiliary does.

So yeah. That little test right there is reason to doubt the results that analyzemywriting.com produces for Early Modern spellings.

What about thou thee thy thine? These four words, pronouns all, should yield a lexical density of 0%.

Only two of these are parsed correctly as pronouns. thy and thine are tagged wrongly as "adjectives". Having been analyzed as lexical words instead of function words, they drive up the lexical density.

Let's continue.

I took this sermon (written in 1604; printed in 1623), cleaned out the page numbers and some of the formatting artefacts. I took one paragraph, the one on pp. 190-194 starting with "in conclusion", and re-spelled it into modern English. (See here for a .txt file). I then fed both versions into analyzemywriting.com, just to see if there were differences.

Oh yes, there were. There were problematic problems, even.

First, the overall figures for lexical density of these paragraphs were different. The Renaissance original got a score of 51.3; the
Re-spelled version got 46.24. That's 5% difference.

Where do those differences come from?

Well, turns out that there were things I thought could be problems when I was respelling the text, and that actually are: to teach vs, that wee must not "vs" is analyzed as a noun, and not as the pronoun "us"; "wee" is analyzed as an adjective (as in "a wee lass") and not as the pronoun "we"; though it bee neuer so small "bee" is analyzed as the noun and not the copula "be".

(I must add that there were things I thought could be problems but that turned out not to be: heauen, vnderstand, keepe, needefull, feare: these things are indeed analyzed as lexical items. The plural pronoun ye also is correctly identified as a pronoun. Though I did not check whether they're analyzed as the correct subtype of lexical word).

Here are two screenshots. On the left is a set of 10 sentences from the re-spelled version; on the right is the original. The words that analyzemywriting.com judges to be lexical (i.e. noun, verb, adjective or adverb) are in green; the percentage of lexical density for each sentence is to the right. I've underlined in orange a few differences. (Right-click to embiggen)

The lexical density of nearly every sentence is different across the two spellings.

The culprits are the un-modern spellings of doe "do", haue "have" and bee "be" (and also shal "shall"). Now, these are all a) high-frequency words, and each occurrence adds up over time to (likely) significant differences in the counts; and b) they should be counted as function words, but instead they are coloured green and, therefore, drive up the lexical density. Sometimes by as much as 15-20%! (as is the case in sentences 22 and 20)

Here are two more screenshots, this time showing what proportion of a text is taken up by each part of speech that analyzemywriting.com recognizes. The first image shows the percentages for the re-spelled version, the second image for the Early Modern paragaph.

Every single one of these is different. Nouns differ by as much as 8%; and the count for auxiliaries doubles. And they're the same text! These texts contain the exact same number of nouns, and auxiliaries, and pronouns, etc.

If Faktorovich were to import both versions of this text into her corpus, the only chance of these two identical texts ever being marked similar in her kooky and wrongheaded red-font-colour way of testing is if they happened to be adjacent to each other when sorted low-to-high in every single of the 8 columns. Adjacent in the sense of ~9% of the corpus size. Despite them being the exact same text -- just in different spellings.

That's a big ask. I don't think I can trust that to be the case.

Conclusion: This free online service is explicitly designed for present-day spelling conventions, as the makers admit themselves. Applying the software to Early Modern spellings yields unreliable results.

But it gets worse

So. Does Faktorovich use modern spelling, original spelling, or both (which would be the worst of both worlds)?

She uses both: "90 of the tested versions of the texts had modernized spelling, while the other 194 had their original or only slightly edited Early Modern English spelling" (Re-attribution, p. 29)

Insert eye-rolling emoji here. Early Modern spelling, modern spelling and an inbetweeny version. Immediately after this sentence she mentions that versions of Shakespeare with modernized spellings and punctuations were registering as different from Early Modern punctuations, so she switched to original spellings for these texts. But not, apparently, the 90 other modernized texts!

In characteristic Faktorovich fashion, she ignores the issue and goes on to use tests that she herself has found to be unreliable: "While it is best to test texts with the same level of modernization; many of the groups {of texts she attributes to the same ghost-writer} included both old and new-spelling texts, so modernization in many cases leaves enough of a signature for a text to be identifiable" (Re-attribution p. 30).

In other words: IT wORKs sOME of THE tIme So Let'S aSSume IT wORKs alL Of The TIme. I also note that she omits a very important part here: did the similarities between the original and the modern spelling come from the lexical densities tests? Or were other tests responsible for this similarity? That would, in fact, be important to know in deciding whether or not wE CaN aSSume IT wORKs alL Of The TIme.

Conclusion: What does all this mean for her tests?

Well, 8 of her tests concern lexical density and the percentages of various parts of speech. As I have shown here, and as the makers of the site pretty much explicitly note: analyzemywriting.com is not the appropriate tool to use for versions of English that deviate too much from standardized spelling conventions.

I should note here that 5 of her tests are for punctuation marks -- which are often altered by printers, later editors and by the digitization team. This is a problem she seems to be aware of (and it's been brought up numerous times in this thread by others). So, properly speaking, without a systematic treatment of punctuation (either all-original, or all-modern), these five tests should be included here, as well, as "tests that are unreliable because the tests measure a spelling-dependent feature".

But that is a post for another day.

The final facepalm (for now)

When two versions of the same “Funeral Elegy by W. S.” were compared in old (4,450) and new (4,459) spelling varieties, they matched each other on 16 out of 27 tests. Since two nearly-identical texts should match each other on all 27 tests, it follows that texts matching each other on 10 or more tests despite being from separate old/new categories, can be assumed to actually have far greater similarity if they had been compared without this spelling divergences. In fact, a 10-test match is sufficient in this corpus to establish a match because digitization-generated spelling irregularities and various other glitches have introduced linguistic inconsistencies into most of these 284 texts." (Re-attribution p. 30)

Near-identical texts only match as just under 60%, as measured by her method. From this, Faktorovich does not conclude that she must be doing something wrong. Instead: YAY i Can DEpLoY EveN loWEr sTAnDarDs.

References
McArthur, Tom, Jacqueline Lam-McArthur, and Lise Fontaine, eds. 2018. The Oxford Companion to English Language. 2nd Edition. Oxford University Press.

146faktorovich
mayo 8, 2022, 12:13 pm

>142 Keeline: I have never seen the term "filmographic memory" used by anybody other than myself before. I started using it at least a decade ago to describing remembering events as recorded films, in addition to recalling pages from books photographically. For example, I was a town hall meeting where oil executives surprised us by announcing they were going to drill new holes in Quanah. I asked several questions during and after the meeting. I did not have a pen and paper or a laptop in my hands, so it did not look like I was recording the conversation, so they were uniquely honest. But then when I got home, I remembered every word that was said by all sides across this 2-hour or so interaction, transcribed every detail and published an article about it. Here is the resulting article: https://www.prforpeople.com/news/%E2%80%9Cit-ain%E2%80%99t-no-california%E2%80%9...

147faktorovich
mayo 8, 2022, 12:26 pm

>143 prosfilaes: You are misinterpreting what I am saying by imagining what I said instead of quoting what I actually said again; go back and read the full post 414. Every appearance of the term "Holy Ghost" from the Renaissance absolutely does not indicate a confession they were ghostwriters. The specific instances I cited subvert this term to refer to the authorship of the theological text by the "Holy Ghost", and do not merely describe the "Holy Ghost" without putting the credit for authorship into this "Ghost's" hand. The credit line that a text was written by "Nashe's Ghost" is not at all convoluted, but is a direct satire about "writers" who do not really do their own writing (perhaps because they are dead, or perhaps because they are illiterate) and instead have a "Ghost" (i.e. ghostwriter) write their book(s) for them. If you think there is something metaphysical about this, you have missed the joke. The joke is that a Ghost cannot write, and that a professional writer has to step in to write the text that is credited to a posthumous byline of an author such as "Philip Sidney" or "Marlowe", who did not have any book credits to their name during their lifetimes, and then seemingly start writing for the first time post-death as "Ghosts". No, the Workshop was not saying the ghosts were actually writing. It is amazing that you guys keep repeating this as if you do not understand it. Can you clarify what you are confused by? It is not "extra-linguistic", but basic linguistics to grasp that the Workshop was repeatedly confessing to their own authorship by joking about the possibility of Ghosts (instead of them) doing the work. There are several statements regarding ghost-authorship made in the front-matter poems to "Fletcher" and "Beaumont's" 1647 Folio that was one of the last Workshop projects released while the last one of them (Percy) was still alive. There is a chapter in Volumes 1-2 about these references in the Folio that explores their confessional implications. The statistical tools I use are the exact right tools to establish an attribution conclusion.

148Keeline
mayo 8, 2022, 12:28 pm

>145 Petroglyph:

To begin with, I tried this query:
loved lovèd lov'd

I have a few questions and comments about the baseline tests described.

Is lovèd with the grave accent what is normally found in vintage printed and/or modern texts under consideration?

I've seen the first and last but don't recall the middle one.

Were these three words the only ones in your corpus?

I ask this because there are a couple ways for software to identify the parts of speech (POS). One of them is a word list but as we know, a given word can have different meaning based on the sentence structure and other context hints (e.g. "made of lead" as a noun vs. "lead a group" as a verb).

From the descriptions, https://AnalyzeMyWriting.com, is a "black box" experiment where you input a text, run some tests, and get some report. But we only know what they say the software does inside.

Since it is a free service, they probably have not spent hundreds of hours to develop new and innovative code to analyze a text and come up with new measurements that have never been done before. Instead, it is more likely that they have collected open-source tools (not identified?) and made a friendly web user interface to implement them. The terms of service rules are designed to prevent network and processor abuse at the Denial of Service (DoS) level.

Supposing they did use some open-source textual analysis tools, the source code for them is available and they are documented. But not knowing which ones means that we have to play a guessing game like the "black box" experiment of a physics class.

I agree with the concern expressed:

If this free, online service is calibrated for present-day spelling conventions, that means that it stands to reason that Early Modern spelling conventions, or the lack thereof, may have an impact on the results that this black box spits out. Enough of an impact to validate the results, perhaps?

The word lists, POS identifications, and sentence structure of the tools used are almost certainly optimized for modern English writing. Other languages and other time periods of English are likely to produce weaker results. It will recognize some words and structures but fail to identify others. This noise in the results could be the difference in identification, especially if the later analysis method tends to further mask details.

As an illustration, a colleague of mine applied some reading scores to certain series books that had been digitized. The reading scores were provided by Microsoft Word, I believe, and were named. They gave grade levels based on the documented methodology of each test. So even though Microsoft Word is notoriously the opposite of an open-source application, the tests most likely follow the reading store methodology. As it turned out, these reading stores had some interest but were not a means of distinguishing one ghostwriter from another. There were other factors involved, including the outlining and editorial influences of the Stratemeyer Syndicate and the publisher editors. To complicate things, these people and standards changed over time. The books published today, well after the Syndicate ceased to be a separate entity, are directed to much younger readers than were the originals.

My assessment of this situation is that you can use available tests but not all of them will be relevant. Let's say you have a chemistry set and you are going "off book" in your experiments and combining chemicals as they appeal to you. You could use up your supply of red and blue litmus paper on the results and not learn anything if the reaction (if any) did not result in a change of the pH (base-acid scale) of what is left in the test tube.

Over the years I have seen many individual and suites of textual analysis tools that can be deployed on a server. They can be used at the command line or in conjunction with software to produce dynamic web pages such as Perl, PHP, or Python. Here is a write up on one such tool

https://aicbt.com/authorship-attribution/

Remember that the main motivation for such software is to identify authors of modern texts. This could be for an ethical concern like plagiarism or for some criminal investigation such as a terrorist manifesto. It can also apply to anonymous or pseudonymous text questions such as the insider book Primary Colors or the cases where a famous author tries to secretly release a different kind of book than for which they are famous under a pseudonym. There is more of a financial motivation (along with other sources of motivation) to build tools for this kind of analysis than there is to answer authorship questions of the distant past.

The emphasis on punctuation, which I argue is more relevant in modern texts where the author's punctuation is typed on paper or into a computer and likely to be left standing with less intervention than 400-year-old texts which were typeset from manuscripts (perhaps with little or no punctuation at all) and edited many times, is indicative of the tools above and possibly inspired by analysis like the one described here which has been rebroadcast in many blogs.

https://medium.com/@neuroecology/punctuation-in-novels-8f316d542ec4

In this, the words are removed, leaving only a handful of punctuation marks (; : ' " , ! ? .) to be counted or displayed as walls of punctuation to give a sense of the patterns. It is interesting and relevant only so far as the number of hands manipulating the text does not become so large that we can't tell who contributed what from it. This page (a chapter of a larger work) may be of interest:

https://sites.ualberta.ca/~sreimer/ms-course/course/punc.htm

which contains this paragraph which relates to this discussion (emphasis added):

Generally, manuscripts tend to be more lightly and less consistently pointed than printed books (and with the exception of the punctus, virgule, and the blank space, almost all of our modern marks of punctuation have come into use only since the thirteenth century). Modern punctuation, designed to clarify syntactic structures rather than to indicate breathings, is largely a Renaissance invention, developing during the first generations of the printing press, and codified in the eighteenth century (about the same time that capitalization and spelling became fixed in more or less their current form). Among the earliest works showing "modern" punctuation is Francis Bacon's Essays. An interesting early discussion of the nature of modern punctuation can be found in Ben Jonson's English Grammar (composed ca. 1617, printed posthumously in 1640). Eighteenth- and nineteenth-century punctuation practice varies considerably, but tends to be "heavy"; current "light" punctuation is largely the invention of H. G. and F. G. Fowler, The King's English.

There are many textual analysis tools out there and some have more interest than others. Here's one page with links to projects:

https://digitalresearchtools.pbworks.com/w/page/17801708/Text%20Analysis%20Tools

This one is also of interest:

https://corpus-analysis.com/

Note that most of these make individual measurement, counting one type of thing in an absolute form (how many X in the text) or relative form (how many X per Y words/sentences).

James

149Keeline
mayo 8, 2022, 12:35 pm

>146 faktorovich:

I have never seen the term "filmographic memory" used by anybody other than myself before.

If you believe this, then that is the time to define it rather than make us guess or think you have confused it with the more conventional terms I mentioned.

James

150faktorovich
mayo 8, 2022, 12:38 pm

>144 Petroglyph: Your argument is that I should not use financial dictionaries to establish the meaning of a mathematical/ statistic term, and instead you are using general dictionaries like Collins and Webster? This is clearly a case of you arguing for the sake of arguing without any weight or significance to the counter-argument that you are making that is lacking in any rational meaning.

"Normalization" is "to bring into conformity with a standard"? This is a non-mathematic philosophical definition that is irrelevant when one is discussing computational-linguistic methods for calculating authorship attribution in the most non-biased and the most accurate manner possible. And if you apply this definition, "approximating the statistical average or norm", you are going to change all of your data from the actual data into averages, which would be the maximum possible level of none-sense, as your attributions would all point to a single average author. As for the last attempt at the definition, it is absolutely irrational to divide all 27 tests I used by 100 sentences. This division makes sense for tests like commas (because there needs to be some standard "X out of Y" calculation to account for a text being a wide range of possible words in total), but it does not make sense for characters-per-word (because dividing this measure by 100 sentences only does a second breakdown when you have already calculated the average characters-per-word for the whole text). This is why what you are nonsensically insisting is wrong with my method, are the exact right things that lead my method to return accurate attribution results.

151anglemark
Editado: mayo 8, 2022, 1:49 pm

>145 Petroglyph: Interesting – thanks!

Just checking, though: you say

to teach vs, that wee must not "vs" is analyzed as a noun, and not as the pronoun "us"; "wee" is analyzed as an adjective (as in "a wee lass") and not as the pronoun "we"; though it bee neuer so small "bee" is analyzed as the noun and not the copula "be".

– do you deduce their analysis from the POS percentages? Because I find no list of tokens classified as nouns, tokens classified as adjectives, tokens classified as pronouns, etc, just the pretty coloured bars and percentages. And that's more than a little problematic if it's going to be used as the basis of an analysis of the proportions of each part of speech!

I carried out a slightly similar exercise: I ran four versions of the "Once more unto the breach" speech from Henry V through the lexical density* analyser at analyzemywriting.com.
1. A First Folio text found at https://internetshakespeare.uvic.ca/doc/H5_F1/complete/index.html
2. Gutenberg.org's version at https://www.gutenberg.org/cache/epub/100/pg100.txt
3. A version of uncertain provenance at https://poets.org/poem/henry-v-act-iii-scene-i-once-more-unto-breach-dear-friend...
4. Another First Folio version, with modernised (American English!) spelling at https://internetshakespeare.uvic.ca/doc/H5_FM/complete/index.html

All four texts had the same number of words (273), but they differed quite a lot in other respects, with 34.43% nouns, 4.76% adjectives, and 4.76% auxiliary verbs in Text 1, and 29.3% nouns, 6.23% adjectives, and 2.93% aux in Text 4. For instance. Texts 2 and 3 were the most similar, both of them having 28.98% nouns, 6.23% adjectives, 6.23% adverbs, 15.38% prepositions, and 5.49% pronouns. The difference between those two versions lies in the verb / auxiliary proportion, specifically hard-favour'd in Text 2 which is hard-favoured in Text 3. (Both texts have several other identical 'd verbs, call'd, swill'd, etc).

But this points to another problem, a big one regardless of the age of the text – analyzemytext doesn't tokenize words! That is, "you'd" is counted as one single function word with one single part of speech, even though it is actually two parts of speech ("you" + "'d", pronoun + auxiliary). To illustrate this, I ran a lexical density analysis on the string
"you would you'd call'd". This was parsed as four words with the following POS distribution:
Nouns -25%**
Adjectives 0%
Verbs 25%
Adverbs 0%
Prepositions 0%
Pronouns 25%
Auxiliary Verbs 50%

Checking the words one by one, I verified that "you" was classed as a pronoun, "would" as a main verb, and you'd and call'd as auxiliary verbs.

So, yeah. No. I was well aware of the fact that analyzemytext.com was not built for research, but this is a pretty big flaw which makes the word count and lexical density analysis unreliable even for a modern-day text.

I was going to say something about the CLAWS POS tagger, too, but I think I've been verbose enough here.

-Linnéa

* or, as I have occasionally called it to the amusement of my students, lexical dentistry
** no idea what that's all about

(edited to fix own brainfart – I do know what an auxiliary verb is.)

152anglemark
mayo 8, 2022, 1:32 pm

>148 Keeline:
>145 Petroglyph:
You might be interested in the following paper:
Rayson, P., Archer, D. E., Baron, A., Culpeper, J., & Smith, N. (2007). Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Proceedings of the Corpus Linguistics conference: CL2007. Available here: https://eprints.lancs.ac.uk/id/eprint/13011/

It's a small pilot study, but it highlights some of the issues involved here. The TL;DR version is that with a tagger (CLAWS) that is consistently 96%-97% reliable when POS tagging Modern English texts, the reliability for the Shakespeare First Folio was about 82%, and for another EModEng corpus with texts related to law and the sciences it was 88.5%. Their spelling variant detection tool VARD improved the reliability a bit, though. The VARD tool is available at https://www.clarin.ac.uk/vard, and the CLAWs tagger (which is rather more useful than analyzemywriting to a researcher) is at https://www.clarin.ac.uk/claws.

-Linnéa

153faktorovich
mayo 8, 2022, 1:59 pm

>145 Petroglyph: "It is well-known that lexical density is a feature that changes with genre. Plays, which are virtually entirely dialogue, are likely (I had to correct this typo from "likeley") to contain a much higher number of pronouns than, say, philosophical works." I already tested the "lexical density" measure output on 284 texts just in the Renaissance corpus, so there is no need to go beyond it to answer the question of how genre impacts it. On the lower-end of this spectrum are mostly Verstegan's texts, including the KJ Bible, but there are also speeches, theological satire pamphlets (Marprelate), and Vere's letters. Byrd's section of poetry-drama, "Addition III" out of "Shakespeare's" "Sir Thomas More" is also on this low end of the spectrum. And Verstegan's rare play "Wilkins'" "Pericles Prince of Tyre" is also on this lower end of the lowest 19 rates out of 284. On the other top extreme of this range are mostly Sylvester and Harvey, with the top spots going to Harvey's poetry, such as "Rowlands'" "Betraying of Christ". There are also plays on this top end such as Sylvester's "Lodge's" "Wounds of Civil War". And a bit less high but still on the higher end is Verstegan's non-fiction rhetorical text, "Meres'" "Wits Common Wealth, Part 2", which ranks 240 out of 284 on this "lexical density" test. Intuitively one might have guessed that rhetorical non-fiction would have the higher "lexical density" scores, but the numbers indicate the opposite to be the case. There are different genres both on the high end and on the low end. There is more of a pattern with Verstegan and Jonson being on the lower end and Harvey and Sylvester being on the higher end of this spectrum than between specific genres being at the top or bottom.

"Lexical density, as measured by this website, tends to be higher in informative writing than in fiction on the order of about 7%." This points to a major problem of assuming one knows what the results are going to be before starting an attribution experiment. While the specific texts they chose to test between non-fiction websites, and fiction might have pointed to a higher lexical density in non-fiction; there are many writers that write extremely linguistically complex fiction, and many who write very light or linguistically simply non-fiction. An attribution method is designed to spot linguistic differences between texts without giving any weight to the genre. This is why my combination of tests works, as it (as I have explained) shows linguistic patterns that override genre-lines.

A single test indeed does not lead to a precise attribution. So I do not claim anywhere in the study that merely, "Different density means different people". By putting this statement in quotes you suggest that I wrote it, when in fact this statement does not appear anywhere in BRRAM. Because any single test can produce an attribution glitch for an unpredictable reason, I use a combination of 27 different tests. In contrast, the standard accept Stylo method uses a single test only for word-frequency, and this indeed creates glitches, and assumes erroneously that: "Different frequency of word-usage means different people".

The test measures: Lexical words/ All words. Lexical words: "nouns, adjectives, verbs, and adverbs." Non-lexical words: articles, prepositions, conjunctions, etc. Here is what you state:

---"I *gave* it to her" (where the lexical words have been underlined, for a lexical density of 1/5 = 20%); and a more technical, written text might put that as "The *person* was *given* the *book*" (for a lexical density of 3/6 = 50%).---

I do not believe Analyze is calculating these measures correctly, and if I had designed this test, I would have done it differently. One problem is that "pronouns" can be used instead of "nouns", so they should be counted as "lexical words". The definitions for "lexical"/"non" do not specify that pronouns are "non-lexical", but this is indeed how the system calculates them. And "was" is technically a verb, so even if it is a "linking verb", it should not be counted as a non-verb if the definition states all verbs are "lexical". While I would not have designed these parameters for this test, whatever the rules are, when they internally consistent (and relevant to author-attribution) and they are applied consistently to a set of texts generate results that distinguish between linguistically different signatures.

My statement in the book about this measure holds as true. "This is a measure of an author’s tendency to be verbose or to compress their thoughts into challenging language (indicating greater lexical density), or to write lightly and with brief strokes (registering as low lexical density)." The second example has more "verbose" or more complex words. Nouns and adjectives are more verbose than articles or other "non-lexical" words; a sentence with 10 words with 50% of them as adjectives and 30% as nouns is more compressed in meaning, than one that has 80% of prepositions. An article is technically a "brief stroke" as it is shorter than longer nouns, verbs etc. (especially if the shorter pronouns and "linking verbs" are not counted as lexical).

I address the question of Modern vs. Early Modern spelling in an extensive section in the "A New Computational-Linguistics Authorial-Attribution Method Described and Applied to the British Renaissance" chapter of Volumes 1-2. And yet now that it would be relevant, Petroglyph is not quoting what I explained about spelling differences, and is instead paraphrasing what I said why not crediting me for having said it, suggesting I ignored it and he is saying these points himself. In my explanation of how I prepared texts for testing, I explain that I altered all unreadable for programs characters such as the è in "lovèd" into an e automatically, by applying the auto replace feature across each manuscript. So this typo in the original transcribed text, would not have negatively impacted my results. I also explain that there are indeed major changes between the Old and Modernized spelling texts, such as those that include 'lov'd" vs. those that do not have such contractions. This is why previous computational-linguists have not been able to distinguish "Shakespeare"-bylined texts into the two distinct Percy and Jonson signatures, as they have been texting Modernized versions of these texts that are very different from the original versions as editors have made an enormous impact on their linguistics. To fix this problem, I re-tested all of the "Shakespeare" texts in their original-spelling versions, and this made the test sensitive enough for the authorial signatures to be distinguished. The presence of a similar percentage of unreadable words such as "lov'd" in all texts for each of the signatures, just means these have a similar white-noise impact on the results, as they do not change the attribution outcome, but present some haziness. This haziness is one of the reasons the Renaissance texts have more glitches or mismatches or smaller percentage matches between texts than texts in the 18th-20th centuries that do not have some glitches. Despite this problem, it is far less likely glitches will impact the decision when there are also tests for punctuation and other non-word-frequency related items; in contrast, most other computational-linguists only test word-frequency, so such glitches occur at the same frequency for them, but they have a far greater impact on skewing the attribution decision towards an error.

I explain the special case of "Funeral Elegy" across Volumes 1-2, and not only in this passage. The other observations explain that it has been re-attributed to nearly a different other bylines by different linguists across the history of its scholarship. The main reason it only matched on 16 tests to the other version of itself is because one was modernized spelling and the other was original-spelling. As I explain across the series, Modern-spelling versions of "Shakespeare" etc. do not properly match the Old-spelling versions because the editors introduce their own style. By testing both of these versions for many different texts, I proved the editor-bias or editor-skewing to be a fact, and especially in the case of this "Funeral Elegy". Linguists who have been comparing the modern or old spelling versions might have been coming up with different attributions simply because they were only considering one spelling style. And I have seen almost no computational-linguistic studies that disclose if they are testing original-spelling or modernized-spelling versions of texts, or review how impactful this difference can be (as I have done across BRRAM).

154anglemark
mayo 8, 2022, 2:02 pm

Huh. There seems to be something pretty odd about how analyzemywriting handles verbs. Could someone else do a lexical density analysis at analyzemywriting.com of the sentences "You did it. You will do it. I said that you would do it." and tell me what your result is, and if it makes sense?

155faktorovich
mayo 8, 2022, 2:13 pm

>148 Keeline: It is irrational to reject a tool that accurately performs a basic test in a manner that is stated in its manual, simply because it is free and you are assuming this means the developers have not invested time to polish the provided tool. In practice, Analyze's tools have worked to distinguish between linguistic styles in the corpuses I have applied them to. Based on Petroglyph's test, the system only makes mistakes when it is applied to corrupted letters such as an e with a stress-mark above it. Otherwise, it consistently groups verbs etc. in the right categories. It is unnecessary for it to recognize Early Modern English spelling variants. And there are no programs I have come across that are capable of sifting linguistic statistics for the many variants that come up in these Renaissance texts.

In my experience, EEBO and other digitizing services preserve most of the original punctuation to make punctuation tests very accurate even for Renaissance texts. They introduce some glitch punctuation such as strange large dots in the middle of lines (mostly when a letter cannot be read by the system, as when there is a line above a letter to indicate it is a contraction, with an n/m that is missing but should be after this letter); these dots can be and were removed automatically by me during the text-cleaning process before testing. Pre-Renaissance punctuation oddities are not relevant to my tests of Renaissance texts that use generally similar punctuation rules to those we currently use.

156faktorovich
mayo 8, 2022, 2:18 pm

>149 Keeline: I assumed you all know what "filmographic" and "memory" mean separately, or that you could have looked up these terms, and then derived what the two of them when used together would have meant. It would indeed be amazing if Petroglyph defined each one of his terms every time he used them to make sure we all knew exactly what he meant by "normalized" etc. He cannot even give a single coherent definition that makes this term make sense when asked about it, and instead gives a rainbow of possible meanings to add nonsense on top of nonsense, instead of acknowledging he is using nonsense terms to make it sound like he has some cryptic expert knowledge of terminology that proves my method wrong; and nobody can understand how my method is wrong without understanding his special made-up meanings for these terms. If you misunderstand what I mean by "filmographic memory", you are simply confused about the type of memory I have; whereas, if you believe Petroglyph has reached some deep understanding that proves my method wrong in the miss-used terms he is applying to it; then, you are being led to an erroneous or false conclusion about a scholar's research (and not about Petroglyph and his personality or memory).

157Keeline
mayo 8, 2022, 2:43 pm

>155 faktorovich:

It is irrational to reject a tool that accurately performs a basic test in a manner that is stated in its manual, simply because it is free and you are assuming this means the developers have not invested time to polish the provided tool.

You seem to have missed my point.

My objection to this site as an analysis tool is that without knowledge of how it identifies parts of speech or the basis of other counts, it is an unknown "black box" experiment. We know only what they say about the tests.

But since they are unlikely to have developed the site from scratch, devising every string manipulation to perform counts, they are far more likely to take existing tools and put a web page wrapper on it and package that as a tool for people to use, payment or not.

I think your use of "irrational," "nonsensical" and similar terms is a bit overused, especially when you seem to miss key points such as what I was trying to convey in my post. Perhaps it was my responsibility for not wording things clearly enough or making the post too long. I will accept that. Will you do the same?

But my comments were mainly addressed at this particular experiment and the tool being used. If three variants of the same word are used and the software does not have two of them in its word lists, its ability to identify the correct parts of speech will be hindered because it cannot do so from the sentence structure (it's not that smart).

Using the right tool for the job is essential. If something is set up to work with modern texts, it may have some output for older ones but that does not mean that it is relevant and reliable. That has not been established to the satisfaction of the readers of this thread. You may be happy and confident with it but we've gone through more than 1,000 messages back and forth and convincing is not occurring.

You have asked for questions about your method. We have made them and often the responses, if they are made, are not clear and convincing. I'm willing to listen but this process has not gone well.

James

158Keeline
mayo 8, 2022, 2:48 pm

>156 faktorovich:

I assumed you all know what "filmographic" and "memory" mean separately, or that you could have looked up these terms, and then derived what the two of them when used together would have meant.

No, that's not a safe bet. Neologisms should be defined, especially since you believe you have invented it.

If I were to go on the first five or so pages of responses on the Google search, I would be led to believe that you are experiencing a condition associated with Asperger's, a form of high-functioning autism. I don't think that is what you are trying to convey with this usage and that is why I asked. Instead you are responding with a "you and everyone knew what I meant by this term I invented" response.

My request is that you should be specific. Be clear. Be concise. Be accurate.

Probably I am not successful at all of those and that leads to misreadings and incorrect conclusions of what I have written.

James

159faktorovich
mayo 8, 2022, 2:50 pm

>151 anglemark: Glitches such as counting "you'd" as a single word are applied consistently when the compared texts are all in Early Modern English, and thus include a similar percentage of these types of contractions. This only becomes a problem when Modern and Early texts are compared to each other, and then they register with very different outputs via Analyze, as I have explained previously.

Why isn't Analyze a more precise tool? Can you cite a paid-for program that does not generate some glitches, or that can recognize all variants of Early Modern vocabulary and distinguish between "you'd" and "you"+typo-letter? Analyze is one among a few of the best tools I have come across (as I use a few different tools that I list in Volumes 1-2) and it happens to be free. I did not receive any funding for any of my research, so it would be strange if the best programs of this sort received any significant funding; instead the worst services are likely to be the most efficient at manipulating their way into funding. I demonstrated previously the various extreme glitches that Stylo generates that makes even running a basic test without the system gargling up or adding typos to the text impossible. The best method is the one that achieves the central goal of a study in the most convenient and accurate way possible, and the tools I have used achieve accurate attribution conclusions.

160faktorovich
mayo 8, 2022, 3:05 pm

>152 anglemark: I see, all of you have been selling this CLAWS tagging system as superior. But you state that it is designed for "Modern English" texts, so it does not solve any of the problems with Early Modern texts using variant spellings etc. You also do not state if the "Shakespeare First Folio" tested was in Early or Modern English; if in the latter, it does not solve the general Early spelling glitches that you guys have been discussing. I have not seen any of you point to specific tagging errors in Analyze, as instead you have all pointed to how there have to be spelling errors or other glitches in a text for Analyze to misattribute a word to the wrong word-type category. Just because somebody wrote a bogus study that makes their tool sound superior to other tools does not mean their tool actually identifies more Standard-English-spelled common nouns than a rival free service.

I tested CLAW just to see if it is usable. The output is a file that places tags next to words using this abbreviation system: https://ucrel.lancs.ac.uk/claws5tags.html Because there are around 50 different tags, a researcher would have to manually add up each tag and then combine tag types to receive even a simple measure like percentage of nouns. Only by writing a program would I be able to perform the step of this calculation automatically, so it is designed for somebody who is writing a program and not for a statistician that just wants to apply an automated tool to determine linguistic measures that can then be compared and analyzed. VARD also appears to tag different words with different colors, but does not even appear to give data that calculates the number of different word-types in the text. I did not download this program to find out, since I assume it is similarly limited as is CLAW. I would absolutely use VARD or any other program that could calculate for free the number of nouns etc. in a text, especially if it increased precision. But based on what I have seen these are not better tools.

161Petroglyph
mayo 8, 2022, 3:09 pm

>154 anglemark:

That is strange. Total lexical density: 7.14%. Zero in the first two sentences.

It appears that did is only counted as an auxiliary; and so is do. That makes sense for things like do-support for questions and negatives: "Did you ask her? I did not ask her; I do want to ask her". It does not for "You did it" and "I said that you would do it."

It looks like have is counted as an auxiliary as well, all of the time. In "have your cake and eat it", it's counted as an auxiliary (though it should be main verb); and in "have you done it", it's correctly counted as Aux. In "Have you any wool?" it's counted as aux (though it should be main verb).

"Hast thou any wool?" -- hast counts as main verb.

That made me wonder why analyzemywriting counted the spelling "haue" as lexical all of the time. What does it think that word maps to? Hate or something?

Based on the results for "have you met me?" and "haue you met me?", the website thinks haue is a noun.

Looks like when analyzemywriting.com analyzes the words in a text pasted into the website, it runs a comparison against a stored dictionary where every word is tagged with a single part of speech. There's some fuzzy mapping going on between input text and dictionary -- feare, goodnesse, vnderstanding are mapped to the right type of word. Is the software just replacing letters until it gets a match in the dictionary? And it chooses the option with fewest replacements?

Now, since passive voice in English is a specific conjunction of BE + participle, I wonder how AMW handles passives. How accurate are passives on this website? (That was already going to be a future lunch break experiment (tm), but I'm curious now).

I've sent an email to the website asking for details on how their tokenization and POS-tagging works. Will report back if they reply.

162faktorovich
mayo 8, 2022, 3:16 pm

>157 Keeline: You guys have tested Analyze's output, and concluded that the measures it provides fit the basic rules its manual describes. While it is strange how they count different types of verbs, they do apply their rules consistently. It would be impossible for any researcher to check every word in an open tagging system for accuracy when there are millions of words in the corpus. And doing such checking and correcting a miss-tagging for every instance of "you'd" would require not only an enormous amount of effort, but also avoiding adding new glitches when automatically correcting all of these and erroneously changing "you'don't" or the like. Unless you suspect that Analyze manipulates noun-counts by hacking into the system and altering results (which you can check by manually testing or testing with other tools the same texts I give Analyze measures for in the GitHub file), the "black-box" nature of the output they provide is not a problem. The data I need for attribution is the number of nouns, and not the specific nouns' tagging; the latter is in a way calculated in my most-frequent words test.

You are all obviously affiliated with the rival tools you are selling such as CLAWS, Stylo, and the general computational-linguistic attribution word-frequency-only method. Thus, it would be just as absurd to believe it is possible to convince any of you in the precision of my attribution method, as it would be to believe it is possible to convince a Tobacco administrator to post that he has come to the realization it is better for humanity if they just stop producing Tobacco.

163faktorovich
mayo 8, 2022, 3:23 pm

>161 Petroglyph: No you object that "feare" might be correctly identified as "fear"? As long as all such cases are tagged consistently, in a corpus with millions of words, the overall attribution decisions will not be negatively impacted. You can guess if "passive voice" might not be accurate, but until you actually prove this system makes specific errors in cases where any rival system could identify the item correctly, you are not raising any actual objections against using Analyze.

164Petroglyph
Editado: mayo 8, 2022, 3:50 pm

Quick note:

I am attaching so much importance to have being counted as an auxiliary only, because that does make a difference.

In present-day English, "have been, have seen, have gone, have arrived, have come" are normal, but "Have you any wool?" is not. It was in Early Modern English.

Have was used much more often as a main verb in Early Modern English than today. So if the website marks every occurrence of have as an auxiliary, that is a big mistake with lots of impact on the numbers for verbs and auxiliaries (and, therefore, the lexical density). All instances of haue are mapped onto a *noun*, for chrissakes! That, too will have an effect on counts of nouns, auxiliaries, and, therefore, lexical density.

(Early Modern English preferred the auxiliary be for verbs of motion: "they are arrived; she is come". It took a few centuries for have to completely take over).

165Keeline
mayo 8, 2022, 3:37 pm

>160 faktorovich:

I tested CLAW just to see if it is usable. The output is a file that places tags next to words using this abbreviation system: https://ucrel.lancs.ac.uk/claws5tags.html Because there are around 50 different tags, a researcher would have to manually add up each tag and then combine tag types to receive even a simple measure like percentage of nouns. Only by writing a program would I be able to perform the step of this calculation automatically, so it is designed for somebody who is writing a program and not for a statistician that just wants to apply an automated tool to determine linguistic measures that can then be compared and analyzed.

Being a programmer or having access to one is a basic requirement of this kind of work. Just as being well grounded in statistics is needed.

Depending on your computer system (Windows is harder, Mac/Linux easier on this), it is not too difficult to write scripts that count occurrences of a given tag. I'd have to see some sample output to try my hand at it. Right now I see just the list of tags. On the Mac or Linux command line I would look to standard tools like awk or sed and wc to get some of the counts.

Counting and other repetitive tasks is one of the things that computers are well adapted to handle. They just need to be told how you want it done in terms they can understand. This is programming.

The whole business of identifying parts of speech is very hard to get right. But in this kind of study it can be important if your authorship claim depends on verb and other POS usage relative to the whole work.

James

166Petroglyph
mayo 8, 2022, 3:41 pm

>163 faktorovich:

I'm not objecting to that. The fuzzy mapping does its job correctly sometimes, and that is good. But it turns out that the fuzzy mapping makes mistakes, too: the high-frequency verb and auxiliary haue gets mapped to a *noun*.

until you actually prove this system makes specific errors in cases where any rival system could identify the item correctly, you are not raising any actual objections against using Analyze

You're play-acting at being a scholar, and it shows again.

We're demonstrating that AMW has issues with identifying auxiliaries correctly. Since the English passive is a specific set of combinations of Aux and Participle, that means that the counts for passive voice may be inaccurate and unreliable, too. That should concern you. (But you're not a real scholar, so it doesn't.)

Also, it is not anyone's job to provide you with better tools. You're the one who selected a free tool that spits out answers you do not understand but that are easy to (ab)use. If it can be demonstrated that this analyzer of *present-day English* is inadequate for analyzing Early Modern English, well, the only one impacted is you.

If AMW is shown to be inadequate, then that simply means that you cannot base a massive re-attribution on its results that flies in the face of everything we know about the period, the people, the texts, and how linguistics works.

167Petroglyph
mayo 8, 2022, 3:49 pm

>162 faktorovich:

It would be impossible for any researcher to check every word in an open tagging system for accuracy when there are millions of words in the corpus

That's why you work with taggers that have proven accuracy rates of +95%. Duh!

Your manual workflow and your limited imagination are showing again. Why not apply this obsessive need to be able to check everything by hand yourself to the black box that is AMW?

Stylo offers various options for tokenization out-of-the-box: you can choose whether things like don't is counted as a single word/token, or as a combination of do + not. The effort involved (in the GUI) amounts to checking a box; on the command line, you change the argument corpus.lang = "English.all" to something else.

The data I need for attribution is the number of nouns, and not the specific nouns' tagging; the latter is in a way calculated in my most-frequent words test

AMW counts haue as a noun. Surely that's not right? Surely such a high-frequency word throws off the counts in a way that matters?

168Petroglyph
mayo 8, 2022, 4:42 pm

A quick test for passives:

"It is done. It gets done. It has been done."

The middle sentence is not recognized as a passive. The other two are. Looks like the website looks for the combination of a form of BE with a participle.

The image behind that link also shows that the website says: "Identify where you might be using passive voice". It also says "about 2 out of 3, or 66.67% of your sentences look like they might be passive". And the sentences that are flagged as passives are explicitly flagged as potential passives: they all read "Check for passive".

So. The website is telling you not to just accept the total number of passives at face value -- it tells you that they *might* be passives, and it says to check if they are.

Wanna bet that Faktorovich just copy/pastes the total number?

I quickly tested that one paragraph from that sermon again -- in the original spelling. The site correctly identified three passives, but also this thing:

Look at the words marked in blue: the website is not looking for a form of BE and a participle (it does not use that as one of its parts of speech). Instead, it looks like a form of BE followed by something that *looks like* a participle (ends in -ed).

It is easy to trick such easy apps into giving wrong answers (try "They were bound for Austria. I am prepared to die" -- both are incorrectly marked as potential passives). But the fact that a tagger doesn't get everything 100% correct is not a real objection.

This, however, may be, I think:

While the blue words indeed constitute a passive, there are many more verb phrases/clauses in that passage, and the software only counts it as one. And therefore, all of this passage gets counted as passive.

Same story here:

If one verb phrase in a sentence is passive, the entire sentence is counted as passive. One sentence with multiple passives only counts as one passive. Not sure if that is the most useful level of granularity to look at passives. At least it is used consistently?

Preliminary conclusion: passives are kinda iffy on AMW. It's not as big a failure as Faktorovich's handling of lexical density (or so, so, so much else in that vast mess of misunderstandings and critical failures that is her "method"). AMW not being the greatest at identifying passives isn't that much of a critical error, either. The biggest problem I have is that Faktorovich uses as "% of passive sentences in the entire text" a measure that is clearly marked as the percentage of potential passives.

In other words: it's not so much the small shortcomings in the software that bother me. It's the careless overgeneralization of the person using the software.

169Petroglyph
mayo 8, 2022, 4:44 pm

>152 anglemark: Looks like an interesting study. Perhaps I'll have some time next week to look at it. Thanks!

170Keeline
Editado: mayo 8, 2022, 4:56 pm

>162 faktorovich:

Perhaps there are millions of words but how many different words ? That should only be in the thousands.

I'm not sure why you have addressed this response to me. I have not used these tools and I am certainly not shilling for them. But I do want to know exactly what is being measured and how this is achieved. I have no good reason to trust "black box" analysis. That is like saying:

"This is what it says. TRUST ME."

Well, we are not there yet.

Something is strange about the CLAWS page link provided since the many links I have tried don't go anywhere. The filesystem on the server must be rearranged. Backing up, I went to the Wikipedia page for CLAWS

en.wikipedia.org/wiki/CLAWS_(linguistics)

From there I found the project's website:

https://ucrel.lancs.ac.uk/claws/

and their FREE web tool which has similar terms of service to AnalyzeMyWriting.com. There are licenses for universities and people who want to integrate their program into another application being written.

http://ucrel-api.lancaster.ac.uk/claws/free.html

I took a text I had handy, Tom Swift and His Photo Telephone (1914) and selected CLAWS5 vertical output. Here are a couple sentences of output:


0003420 001 ----------------------------------------------------
0003420 010 " 00 PUQ 
0003420 020 It 00 PNP 
0003420 030 certainly 00 AV0 
0003420 040 is 00 VBZ 
0003420 050 , 00 PUN 
0003420 060 " 00 PUQ 
0003420 070 agreed 00 VVD 
0003420 080 Tom 00 NP0 
0003420 090 . 00 SENT 
0003420 100 ----- 00 PUN 
0003421 001 ----------------------------------------------------
0003421 010 " 00 PUQ 
0003421 020 And 00 CJC 
0003421 030 the 00 AT0 
0003421 040 next 00 ORD 
0003421 050 time 00 NN1 
0003421 060 you 00 PNP 
0003421 070 go 00 VVB 
0003421 080 for 00 PRP 
0003421 090 a 00 AT0 
0003421 100 conference 00 NN1 
0003421 110 with 00 PRP 
0003421 120 such 00 DT0 
0003421 130 men 00 NN2 
0003421 140 as 00 CJS 
0003421 150 Peters 00 VVZ 
0003421 160 , 00 PUN 
0003421 170 look 00 VVB 
0003421 180 out 00 AVP 
0003421 190 for 00 PRP 
0003421 200 airships 00 NN2 
0003421 210 . 00 PUN 
0003421 220 " 00 SENT 
0003421 230 ----- 00 PUN 
0003422 001 ----------------------------------------------------

I see that LT Talk does not use the <pre> tag in a conventional way that preserves spaces. Here's a graphic that shows the output in proper columnar format.

With text output like this, it would be possible to determine counts for each of the 62 (I let the computer count them for me) tag types in CLAWS5. There are 137 tag types in CLAWS7.

I saved my output as a text file and ran this sequence of piped commands on the Mac command line (same would work in Linux/Unix).

cat TS17-CLAWS.txt | awk '{ print $NF }' | sort | uniq -c

I did an extra step to make it into columns so it is not too hard to read here:

This shows the count of each type used. I could start with a list of the available tags and get counts even if 0 were used.

If there's interest I can explain the steps in the command above but this isn't Stack Exchange and I don't know if there is interest.

You are all obviously affiliated with the rival tools you are selling such as CLAWS, Stylo, and the general computational-linguistic attribution word-frequency-only method.

Please stop with the sweeping and inaccurate accusations. If the tools are FREE for CLAWS and Stylo, there is no selling. They are established tools that people in the field know and trust. CLAWS was developed in the 1980s and has had refinements since then.

James

171anglemark
mayo 8, 2022, 4:56 pm

>161 Petroglyph: Yep, that's what I got as well, so it's not just me then. It's probably inevitable that some instances of do as a main verb are mistagged, but it does like they have programmed the AMW tagger to always treat the primary auxiliaries do, have, and be as auxiliaries, even when they function as main verbs. In "You did it" there isn't really any room for misinterpreting "did" as anything other than a main verb! (I tried some sentences with be as a copula and it consistently tags it as a function word.)

As for why it interprets haue and thy as nouns, it's pretty weird, but I suspect you are right – the tagger just matches words to a dictionary, instead of applying rules about English syntax. A normal part of training an automated POS tagger would include which parts of speech can appear in a particular slot.

Thanks for emailing them about it, and about the tokenization! I'll be very interested to hear what they reply.

-Linnéa

172Petroglyph
mayo 8, 2022, 5:07 pm

>151 anglemark:

do you deduce their analysis from the POS percentages?

I do. I tried these out in small phrases and on their own.

just the pretty coloured bars and percentages. And that's more than a little problematic if it's going to be used as the basis of an analysis of the proportions of each part of speech!

You're quite right there! There's no other colours in the sentences at the bottom other than "lexical". That's insufficient.

The main issue with using AMW for this kind of research is, as you rightly point out, that the software was not developed with that kind of thing in mind. And so everything else follows from there (the results not being precise enough; the black-box approach; the bare-bones POS tagging; the lack of more specific colours in the sentences at the bottom; the rigidity of the spelling requirements; not being able to add non-standardly-spelled words along with their expected POS; most of the tests not being suitable for authorship attribution ...). The uses to which this very simple tool is put are unreasonable.

I ran four versions of the "Once more unto the breach" speech from Henry V through the lexical density

Nice!

173Petroglyph
mayo 8, 2022, 5:22 pm

>153 faktorovich:

I already tested the "lexical density" measure output on 284 texts just in the Renaissance corpus, so there is no need to go beyond it to answer the question of how genre impacts it.

Given that your results are complete garbage, there is no need to trust them. Your study is not the be-all and end-all of anything to do with genre -- you basically completely ignore all of it.

"Lexical density, as measured by this website, tends to be higher in informative writing than in fiction on the order of about 7%." This points to a major problem of assuming one knows what the results are going to be before starting an attribution experiment

That quote you repeated was the result of a test. Not an assumed conclusion, which is your modus operandi.

While the specific texts they chose to test between non-fiction websites

Wikipedia articles and short stories. Are you saying that short stories are non-fiction? Or did you just read carelessly? Or do you simply not care, in your rush to sound authoritative and play the role of a scholar?

One problem is that "pronouns" can be used instead of "nouns", so they should be counted as "lexical words".

No. Wrong. Such a basic misunderstanding that I'm not going to explain it to you.

And "was" is technically a verb, so even if it is a "linking verb", it should not be counted as a non-verb if the definition states all verbs are "lexical"

Auxiliaries are a fairly constrained subset of verbs. How is it you do not know this?

My statement in the book about this measure holds as true

Nope. Still wrong. But that sentence, taken in isolation, sure does sound like something a real scholar would say.

lovèd ... unreadable for programs characters such as the è in "lovèd" {...} this typo in the original transcribed text.

Ahem. It's not a typo.

174faktorovich
mayo 8, 2022, 8:52 pm

>164 Petroglyph: Having translated 15 different Renaissance texts by now, I can report that the use of "have" as the "main verb" was not uniquely more common than it is today. And even if it was, if this was a pattern across all Renaissance texts, it would be marked in the different category for all of them, and thus this consistency would nullify the significance of this divergence for all texts in the corpus with the original-spelling. On a brief search in "Restitution" I found "have" in "have been", "have had", "will have", "would have" etc. I stopped at this point, as there are no instances so far of "have" without a main verb next to it. Yes, the bigger problem is that "have" was frequently spelled as "haue", but it was thus spelled across most of the old-spelling texts I researched, so if it was always mapped as a "noun", it would also be on-average cancelled out as white-noise because the glitch occurs in all old-spelling texts. "Early Modern English preferred the auxiliary be for verbs of motion: 'they are arrived; she is come'. It took a few centuries for have to completely take over". I have no idea where you found this idea, but in all the texts I have translated I do not recall these types of odd phrasings being common. You have to cite specific examples of a text with many such usages, or the source where you found this conclusion, so readers can check how or if what you are saying is relevant.

175faktorovich
mayo 8, 2022, 8:59 pm

>165 Keeline: If I start counting the occurrences of these tags with a computer program, it would be more efficient for me to write a program for the entire set of 27-tests, vs. investing time in building a program and having it only be applicable to repeat one (or a few) of the step that already works if I just use Analyze's free software (and then having to repeat the other steps manually). It is amazing how much time you are investing in criticizing my failure to have written a program to make my 27-tests method extremely easy for you to use, and telling me how easy it would be for you to create the perfect program that would have applied such tests flawlessly. Meanwhile, I have to invest years into translating texts like "Restitution" to find intricate evidence to further support my quantitative conclusions, having explained in Volumes 1-2 that they have already provided extremely precise and verified results. There is no need for me to write a program because the methodology works by applying tools that have already been pre-programmed. The wheel has already been invented, it would be a waste of time to re-invent it, instead of using that wheel to build a new type of vehicle.

176faktorovich
mayo 8, 2022, 9:04 pm

>167 Petroglyph: Stylo cannot even manage a basic pre-programmed experiment without destroying the data fed into it, and it can only accept files that are posted on websites, and not uploaded into its system. Arguing that Stylo is more precise because Stylo or one of these other programs rated itself as being "95%" accurate is like saying that we should all eat Company X's fast-food because their advertisement said it was "the best".

177faktorovich
mayo 8, 2022, 9:20 pm

>168 Petroglyph: Yet again, the main goal for an attribution test is that it consistently applies the same set of rules/ tags etc. to all of the texts fed into it. Even if a sentence is not in fact "passive" and only "looks like it might be passive"; the Analyze program's rules are going to always identify as "looks like passive" the same types of sentences. Thus, while a grammarian who checks the sentences might find glitches in this system, the same grammarian would have to agree that the same types of glitches occur in the same types of situations; thus the grammarian would state that statistically speaking this measure would be just as a good at identifying patterns of usage, even if its definition of "passive" does not precisely match the grammarian's. The 27-tests attribution method is not a grammar-checker; it is a system to identify patterns of linguistic usage that are similar between texts written by the same author; however one defines "passive" (and it can even be defined as the exact opposite of "passive" by a program) the attribution method will be accurate if it repeatedly tests for this specific measure in a similar manner across all texts and gives answers that are consistent with whatever definition it is using. The accuracy of the "passivity" test can be verified by looking at the data; and you can check this accuracy by going into my 284 texts table re-ordering the "Passive" column by least-to-highest numbers, and you'll see that Percy's texts make up most of the bottom end of this range, and Verstegan's texts cluster on the upper-range. If this test had failed to establish divergence between signatures, the results would have been random at both ends, or with equal mixtures of all or most of the six potential ghostwriters. To check how this works on the large scale of a 30,000 word text, you would have to check all cases of miss-identified passive voice, and check such cases in different texts from different bylines, and then see if there is a similar percentage of miss-identified instances of passive voice that basically cancel themselves out. I can see that this is the case in the data; if the data failed to show divergence, I would have chosen a different test. It does accurately show divergence, so there is no need to fix this test. Yes, in an ideal world a better program can be devised, but apparently specialist programmers in this field are busy finding fault with other programmers' work.

178prosfilaes
mayo 8, 2022, 9:20 pm

>147 faktorovich: The joke is that a Ghost cannot write,

That is extra-linguistic. It is not universally believed; as an aspect of God, the Holy Ghost certainly can write, and many people have recorded what the people from the other side have said. Whether or not you or I believe that happens doesn't really have an impact on what people using the words mean.

I don't think you have much respect for these author's religions. The fact that someone would credit the Holy Ghost then just like many people would credit God now is not surprising to me.

The credit line that a text was written by "Nashe's Ghost" is not at all convoluted, but is a direct satire about "writers" who do not really do their own writing (perhaps because they are dead, or perhaps because they are illiterate) and instead have a "Ghost" (i.e. ghostwriter) write their book(s) for them.

It was credited to "Tom Nash his ghost", and is generally believed to be written by John Taylor. No need to talk about satires on fake authors; it seems quite clear to me that no one at the time believed the author was Tom Nash, and that the author was merely adopting a name to write under, like someone might write as Ronald Reagan's ghost or Abraham Lincoln's ghost to make a point about how those figures might view today.

It is not "extra-linguistic", but basic linguistics to grasp that the Workshop was repeatedly confessing to their own authorship by joking about the possibility of Ghosts (instead of them) doing the work.

The only linguistic information from "Title by Name" is that it is asserting that Title is by Name, even if that Name is "the Holy Ghost". Everything you're saying is extra-linguistic.

179faktorovich
mayo 8, 2022, 9:31 pm

>170 Keeline: The main problem you missed is that you have created 62 different tags for different word-groups. Checking all of these groups separately against a complex corpus such as the Renaissance is more likely to generate glitches than when testing only major groups such as nouns and verbs. While the test for "passive voice" or for "nouns" in Analyze might generate a few glitches from misspelled words, the same errors would also appear in CLAWS, and they would be magnified because you will be using 62 different tests instead of the fewer count that I include in the word-type category in my 27-tests. You have to simplify these results by adding a calculation step that groups these types. Or if you use all of these 62 tests, you are going to have to spend even more time in the later stages of my attribution method, where you calculate and add up proximity results. If you keep advertising tools like CLAWS and Stylo as superior to the specific tools I am using like Analyze; then it is a fact that you are advertising for them; you have not disclosed an affiliation with them, but this is the only logical reason you would keep insisting their tools are the only ones allowed to be used for computational-linguistic attribution.

180faktorovich
mayo 8, 2022, 9:40 pm

>173 Petroglyph: My results are exactly correct, as I have confirmed in finding evidence to support them across borrowings and various other pieces of evidence in 17 volumes of BRRAM so far. You have barely read the few paragraphs from Volumes 1-2 that you have quoted from. Whenever you read other parts of BRRAM, you attempt to take credit for what I explain in BRRAM as if it is your own ideas, and you did not find out about them from my explanations in BRRAM.

"Wikipedia articles and short stories": this statement is clearly creating a duality or stating the two things are different from each other, and thus one is non-fiction and the second is fiction. You are finding fault with absolutely true statements to avoid responding to my actual criticisms of the flaws in your suggestions. Across the following comments you are just saying false nonsense that is not responding to anything I actually stated.

181faktorovich
mayo 8, 2022, 9:52 pm

>178 prosfilaes: It is "extra-linguistic" to seriously argue that the "Holy Ghost" has been proven to be able to "write", and not to argue the opposite; the "Holy Ghost" is a fiction, and it is rational to interpret authorship by the "Holy Ghost" as an absurd or satirical statement, and not as a factual one.

Earlier today, I translated this section from "Restitution":

---The ancient pagan Germans, especially the noblemen, as both Krantz and other writers testify, sometimes took the names of beasts, as one would be called a Lion, another a Bear, another a Wolf, and so in a like manner these two before named princes (referring to "Hengist", whose name is derived from "Horse" and "Horsa", whose name is also derived from "Horse") had their denominations...---

Here is my annotation for this section (Verstegan cites Krantz by name later on):

---Albert Krantz’s "Chronicle" and "Saxony" both include names such as Leo dux Belgicae (Lion, the Duke of Belgium), Lupus Vasconiae dux (Wolf the Basque Duke) and Lupus Karoli (Wolf Charles). Verstegan might have especially noticed a line such as this one in "Chronicle": “…Ursus, alius Leo, alius etiam Catulus dicetertur.”, Latin for, “…Bear, alias for Lion, alias for Cub to be named.” This is an example of Verstegan hinting at his multi-bylined ghostwriting under clearly fictitious pseudonyms (such as “A Monday”) as well as under historical figures' bylines, such as “James I”; he is stressing that the use of such aliases in history books suggests the author is writing clearly fictitious names because the actual names of the historical figures discussed might have been lost to history.

There are thousands of similar references and discussions about aliases and pseudonyms etc. across the Renaissance texts I am studying. Lines such as "Nashe's Ghost" or references to pseudonyms are not the exception, but occur at some point in most of the texts I have translated.

182prosfilaes
mayo 8, 2022, 10:30 pm

>181 faktorovich: It is "extra-linguistic" to seriously argue that the "Holy Ghost" has been proven to be able to "write", and not to argue the opposite

Do you know what extra-linguistic means? Things like the existence of the Holy Ghost are factual questions, outside the field of linguistics.

the "Holy Ghost" is a fiction, and it is rational to interpret authorship by the "Holy Ghost" as an absurd or satirical statement, and not as a factual one.

So you don't believe it, so they didn't believe it? If it was accepted to be placed on the title page, it was because the readers did not interpret it as an absurd or satirical statement. And if the average reader would not have interpreted it as satire, why would you assume the author intended it as satire?

There are thousands of similar references and discussions about aliases and pseudonyms etc. across the Renaissance texts I am studying.

Sure, if you interpret the statement about "The ancient Germans" as a claim about something entirely different. I think this proves that Verstegan was a furry, and I'm sure I can find thousands of similar references proving that the British were all about furrydom at the time.

183Keeline
mayo 8, 2022, 10:48 pm

>179 faktorovich:

Your continuing accusations of collusion by me or anyone else to these publications is not only false but growing to be tiresome. If you have concerns then there are professionals with whom you can consult for this kind of issue.

I have not even heard of Stylo for R or CLAWS before reading about them in this group.

It is CLAWS v5 which has 62 parts of speech tags. If you look at and read them, you will find that they have a level of granularity that your favorite web tool cannot approach.

But if you want to aggregate all of the nouns together for your counts. For example:

NN0 noun (neutral for number) (e.g. AIRCRAFT, DATA)
NN1 singular noun (e.g. PENCIL, GOOSE)
NN2 plural noun (e.g. PENCILS, GEESE)
NP0 proper noun (e.g. LONDON, MICHAEL, MARS)

This can be achieved with an egrep statement:

egrep "NN0|NN1|NN2|NP0" input_file.txt | ...

The 40+ year track record of CLAWS as a trusted tool fo identifying parts of speech with a very high degree of accuracy is why professionals in the field turn to it.

Since the programs are free to use, there is no benefit or motivation to sell anything. They have their reputation and 40 years is longer than most computer tools. Sometimes I think you reject anything that is unfamiliar or requires a little work on your part.

You claim that I've spent lots of time with the other tools is also incorrect. The actual work with CLAWS to paste in a text, generate a report, and use my experience with the tools available in the Unix/Linux command line to tabulate the results in a meaningful way adds up to about 15 minutes. It took longer to write about it than to do it. I started with BSD Unix in 1985 and have worked with Linux continuously since 2000. What I don't know offhand, I know how to look up. But the most challenging part was getting the results to display in columns so they didn't become a 62-line list of numbers and tags. Even that was a matter of making a Google search and consulting 5 pages until I found a solution that would work for me.

I don't know what you mean by proximity results. However, I would note that the numbers on the left of the CLAWS output have the sentence and word number so one could certainly work with that.

My main point was to illustrate that your metaphorical "throwing up of your hands" at anything that is not familiar can be resolved with just a little linking of one utility program to another.

Is it your goal to insult me until I go away? Is this how a professional scholar behaves?

Don't let yourself be characterized by the old adage:

When the only tool in your toolbox is a hammer, everything looks like a nail.

Instead I suggest:

There are more things in heaven and Earth, Horatio,
Than are dreamt of in your philosophy.

Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime

James

184prosfilaes
mayo 9, 2022, 12:34 am

>176 faktorovich: Stylo cannot even manage a basic pre-programmed experiment without destroying the data fed into it,

The fact that you don't know how to use a tool doesn't mean it doesn't work.

>177 faktorovich: If this test had failed to establish divergence between signatures, the results would have been random at both ends, or with equal mixtures of all or most of the six potential ghostwriters.

You're using your conclusion to prove a test that the conclusion is based upon. That's wrong. For this entire thread, other people and I have been telling you that you need to calibrate on a known corpus; you can't prove that something separates works by different authors by running it and deciding whether you like how it separates works.

Yes, in an ideal world a better program can be devised, but apparently specialist programmers in this field are busy finding fault with other programmers' work.

Says the person who is dismissing every scholar's work on the authors of the British Renaissance and who hasn't bothered to study existing programs.

185andyl
mayo 9, 2022, 5:07 am

>184 prosfilaes:

I thought you would pick up on that and "it can only accept files that are posted on websites, and not uploaded into its system>"

Which of course is pure rubbish. Anyone can establish this in under a minute if they wish. R/Stylo works from a corpus subdirectory* on your computer - there is no posting on websites, no uploading. You just copy your prepared file into the appropriate subdirectory.

Of course petroglyph got there first and explained this on post 251 of the original thread.

* OK you can load the corpus from whatever directory you want, but it is a subdirectory called corpus by default.

186anglemark
Editado: mayo 9, 2022, 11:07 am

>159 faktorovich:

Glitches such as counting "you'd" as a single word are applied consistently when the compared texts are all in Early Modern English, and thus include a similar percentage of these types of contractions.

But it is not just a matter of how many words a text contains nor what the proportion of contractions is in a text, though that is part of it. I looked at the first two texts you list in your bibliography, accessing them at the sources you specified that you had used: The Map and Description of New-England by William Alexander, 1st Earl of Stirling, and "A Pleasant Commodie called Looke About you" by an anonymous author, published in the collection A Select Collection of Old English Plays, Vol. VII (4th edition), 1876.

Before getting to the issue of contracted forms, a different issue: The Bodleian online edition of The Map and Description contains a lot of invisible characters. How many of your texts were affected by that, and how did you fix them before you ran the text through AMW? AMW also removes characters such as the macron over ē in words such as "Patēt" and "cēsured", splitting the words that contain macrons – so "patent" and "censured" become "pat t" and "c sured". How do you handle that type of issue, and how many of your texts are affected by it? (Incidentally, I see that it does the opposite with hyphens, removing them and combining the words.)

As for the contractions, The Map and Description apparently contains no contracted forms at all. "A Pleasant Commodie", on the other hand contains a lot of them, e.g.

"ROB. You see I am weapon'd, do not, I beseech thee.
I'll stab them, come there twenty, ere they breech me."

"POR. I wonder how thou cam'st so strangely chang'd!"

But there are also many potential contractions that are written out in full, so on subsequent lines we have

"SKINK. I'll bring it ye to the sheriff's, excuse my absence.
FAU. I will, my noble lord; adieu, sweet prince."

Since this is a play written in rhymed blank verse, scansion was presumably a consideration, but the reason for the variation is less relevant than the fact that there is considerable variation within the same text. We have

"Had you not been, I would have left the place;
My service merits not so much disgrace."

and also

"My lord, but that my case is desperate,
I'd see your eyes out, ere I would be cheated."

Apart from the number of words in the sentences, you have a more serious problem with your indata, which will affect your outdata a lot. "I would" is correctly counted as one pronoun + one auxiliary in AMW. "I'd" is incorrectly counted as one auxiliary. "exil'd" is incorrectly counted as an auxiliary instead of a main verb. "red-cheek'd" is incorrectly counted as an auxiliary instead of an adjective. If the frequency of certain parts of speech are some of your "tests" that allegedly indicate authorship, how do you control for the fact that the tool you use doesn't provide you with the correct proportion of main verbs, auxiliary verbs, adjectives, and pronouns? (This is why tokenization is such a crucial thing.) Just for the hell of it I ran the text of "A Pleasant Commodie" through AMW twice – I did not clean it up, so in both runs, the character labels (the name of the character who speaks on stage) were left in, as were page numbers, stage directions, and other chaff.

Run a) was the text as it is from the source and run b) had all instances of "I'll" expanded to "I will", and all instances of "you'll" expanded to "you will".
Run a) had 26482 words with 10.85% pronouns and 6.75% auxiliaries, run b) had 26708 words with 11.6% pronouns and 5.84% auxiliaries.
What, if anything, does that kind of difference do to your results? (Keeping in mind that this was just the expansion of one single contracted auxiliary with two individual pronouns, in one individual text.)

-Linnéa

187anglemark
mayo 9, 2022, 12:06 pm

>174 faktorovich: , responding to >164 Petroglyph:

Having translated 15 different Renaissance texts by now, I can report that the use of "have" as the "main verb" was not uniquely more common than it is today.

To make that kind of claim you'd need something more than vague assertions based on anecdotal evidence. Why the scare quotes around "main verb"?

"Early Modern English preferred the auxiliary be for verbs of motion: 'they are arrived; she is come'. It took a few centuries for have to completely take over". I have no idea where you found this idea, but in all the texts I have translated I do not recall these types of odd phrasings being common.

See https://www.librarything.com/topic/337240#7814046. That be and have have been used in (constrained) variation as auxiliaries to form the present perfect and past perfect tenses in English is a very well-researched phenomenon. Again, you can't go on what you think you remember from texts you have read – not to mention the fact that you already commented on the construction (in your post 878 in the prev thread) and said that it wouldn't sound odd to anyone who knew English grammar.

To be honest it is a little depressing that you have spent years of your life working with EModEng texts, without being aware of this pattern of variation.

-Linnéa

188faktorovich
mayo 9, 2022, 12:31 pm

>182 prosfilaes: A fuller definition for "extra-linguistic" is "not included within the realm of language or linguistics"; and pretty much all subjects that require language to debate about them are within the realm of "language". Thus, there are many terms that apply to the existence or non-existence of the "Holy Ghost", but it being a subject that is "extra-linguistic" or outside of language is not a relative term. If you are applying the term to mean a general incorporeal theological question of the existence or belief in God versus the non-believers being deserving of dismissal... This conversation has strayed into the opposite from the intended satirical meanings that writing by the "Holy Ghost" was meant by the Workshop to express.

The biggest problem with previous author-attribution studies, and especially those about the Renaissance, is that bylines have all been taken seriously, or as if even an obviously satirically absurd byline like "A Monday" represents an actual author, even if most of these bylines are not supported by any documents that support the existence of a person by that name. If you lack a sense for when an author is using satire (such as exaggeration of a fictitious pseudonym into a blatantly fake one to make a sarcastic criticism) or irony (saying the opposite of the intended meaning); you cannot interpret the actual intended meaning. Without awareness of the intended meaning, a purely quantitative analysis of a corpus is going to lead to the broad misattributions that are now accepted. For example, in the "Shakespeare" byline so that some texts with "Shakespeare" in their original byline are now not accepted as having been written by "Shakespeare", while those that had another byline, or were anonymous are now broadly accepted as by "Shakespeare"; the acceptance of this byline in the first place stems in ignoring that this byline with modernized spelling is "Shake-spear"; and "John Shakespeare's" signature handwriting matches "William Shakespeare's". Thus, an absence of a sense of humor is one of the major obstacles attributors have had when handling this corpus. The "average reader" has absolutely failed to see the humor in these jokes, but this failure to appreciate the absurdity is the fault of scholars and professors who fail to notice it.

While it is possible to prove all sorts of things based on fragments of evidence. The fact that absurd pseudonyms were overused in the Renaissance is provable by statistically adding up the number of bylines with silly words in them, like "Kyd"/"Kid", "Donne"/"Done"; "Chapman" is a name that is explained as being absurd in one of Percy's plays, and Percy was the main ghostwriter under this byline. There is an entire section in "Restitution" where Verstegan explains how names can be created out of alternative spellings for common words. Your conclusion regarding "furries" cannot be proven factually because for starters there were no furry-like costumes that were worn in the Renaissance, and your argument requires proving what they did in their private lives. In contrast, the bylines and their meaning are explained by these Renaissance writers themselves in numerous cases, and they were printed on these books so their existence is already substantiated.

189andyl
mayo 9, 2022, 12:44 pm

>188 faktorovich: your conclusion regarding "furries" cannot be proven factually because for starters there were no furry-like costumes that were worn in the Renaissance

Do you have difficulty detecting sarcasm?

190faktorovich
mayo 9, 2022, 12:48 pm

>183 Keeline: Any tool that is 40 years old, and might not have been edited since it was created is most likely inferior to a tool that performs the same function that was created in the past decade, as computing has leaped forward. And Analyze actually gives the types of output data that my method requires (the statistical data on each individual test component), whereas the tools you have pitched require programming to perform additional steps before deriving the data points I need for my method. If you keep arguing for this ancient tool, you at least should stop objecting that you object to me objecting to you being biased towards it/them, and inserting advertisements for it/ them.

"Proximity results" are obtained in one of the steps in my method, as the proximity (degree of similarity or divergence) of each text to each text is calculated for each of the tests.

Your last quote is about the value of teaching things; don't you find that if this is your motto, you should have at least read some of my BRRAM series, or enough to learn what "proximity results" means? As I have repeatedly explained the things you guys are suggesting are extremely anti-helpful, and have no relevance to perfecting my already precisely functioning 27-tests computational-linguistic author-attribution method. Most of what you are saying are indeed insults. I am just trying to respond to your points to avoid readers misinterpreting my research from the false statements you are saying about it without understanding even a basic component of it like "proximity results". Why do you think I would want or need you to "go away"? If you guys keep saying false nonsense this should prove your position to be false for me, as you keep circling around the same falsehoods.

191faktorovich
mayo 9, 2022, 1:00 pm

>184 prosfilaes: It is a fact that I did use Stylo (as I described earlier in thread #1), and that's how I know it does not work because it is infested with bugs, and it is unusable for any practical linguistic experiment.

It is absolutely not true that I "need to calibrate on a known corpus", as "calibrate" means: "adjust (experimental results) to take external factors into account or to allow comparison with other data". Such adjustments only distort the data to make it fit with a researcher's desired results; or force the re-attribution of texts to the same bylines that they were previously assigned to, instead of establishing the true authorship groupings of the texts. Instead of "calibrating" or adjusting my results to fit with others' data; it is indeed accurate to check if an attribution test works by if the results it provides distinguish between texts in a meaningful pattern; and if the attributions match documentary research into if such attributions were possible, or if they are the most logical attribution conclusions. There are many other checks that I have performed to establish my results are correct, but at no point have I adjusted the experiment's results to fit any previous researcher's data. Doing so is both illogical and immoral.

I have already done a brief review of the existing linguistic programs, if I look any further it would be just to create a program that applies the 27-tests method simultaneously to a corpus.

192faktorovich
mayo 9, 2022, 1:03 pm

>185 andyl: Based on my previous testing, R/Stylo has a glitch that blocks usability of their subdirectories, as it looks like files are added to the correct subdirectory, but Stylo does not see these files, or produces numerous errors and only recognizes files when they linked to from websites. Yes we did discuss this programming error in Stylo in thread #1.

193andyl
mayo 9, 2022, 1:18 pm

>192 faktorovich:

Well it is very telling that the only one who cannot manage to get R/Stylo to do stuff is you. I am not a computational linguist nor am I an R user but I would bet any money that I would be able to download it, install the stylo package, and get it to analyse plain text documents which exist on my machine within 15 minutes.

I think you have amply demonstrated that your computer skills are lacking over the course of the entire discussion.

194faktorovich
mayo 9, 2022, 1:27 pm

>186 anglemark: For "Look About You" I tested the EEBO old-spelling edition, and not the reprint from the 19th century; I used the reprint to help with transcribing the text for the translation of this volume in BRRAM. I also did not use the Oxford edition for "Alexander", but instead used the EEBO old-spelling version of this text, as I did for most of the texts, as EEBO is the most accessible format. With just a few automated corrections for odd punctuation and lines over letters etc., it creates a smooth text without any significant glitches that prevent precise results. I used other sources only when EEBO was not available, or if I had initially tested a new-spelling edition, before adding an old-spelling version, or the like. You can see which edition I used by checking the "old/new" spelling column in the 284 texts data file.

I did not see any ""'censured' become... 'c sured'" glitches with the formats I used because I substituted all of the lines over e etc. glitches prior to running tests.

You might be comparing a modernized edition to an old-spelling edition. If you are not, and you notice more contractions in one of the texts; this is an obvious linguistic difference between two distinct authorial signatures. Even if the counting programs do not understand what the contracted words mean and misinterprets them; they apply the same type of misunderstanding to all texts and this creates a diversion in the output that reflects this over-use of contraction by a given authorial hand. Thus the attribution results are accurate, even if the counting program would have received a failing grade as a grammarian.

Similarly, assuming you are correct in this statement: "'red-cheek'd' is incorrectly counted as an auxiliary instead of an adjective." If it is always "incorrectly" counted and all words like it are always "incorrectly" counted as an "auxiliary"; then, when you look at the pure mathematical data, they are all correctly counted as a similar type of word (be it "adjective" or "auxiliary"). In a perfect world, I would have the time or the money to build the perfect program that precisely counts even an old-spelling test, but to reach perfection, this would be an enormous undertaking. And it is entirely unnecessary, as the simple free tests already reach precise results, and they can also be used by the general public with the no-programming-needed method I already described.

I did delete page numbers and most other irrelevant info, but not the names of characters, and some other repetitions that appeared in the original texts.

A percent change in the data with contractions subtracted is a change that probably would not change the attribution of a text, but it might if it also impacted other tests more significantly. I left contractions as-is to allow all of them be counted similarly (all generating glitches, or all being interpreted in the same systematic manner). After running this test, you should realize that it is not a good idea to turn contractions into words, especially not unless you intend to change all contractions in all texts, and thereby to clean up a likely significant style-indicator that is impacting the outputs (even if a grammarian might guess the contractions are not being counted as what they actually are).

195faktorovich
mayo 9, 2022, 1:34 pm

>187 anglemark: They are not "scare quotes"; they are quotes to cite your usage of the term.

The evidence is inside the 17 volumes of BRRAM; if you opened one of them, you would realize I am not making "vague assertions", but rather those in fact supported in these many volumes.

It is absurd to cite your own ramblings on LibraryThing to prove a grammatical rule about all texts in the Renaissance. If you believe you know more about Language X (having only read your own chats about it) than somebody who has been translating Language X for a couple of years as their daily job; then, you have an inflated sense of your superiority.

196faktorovich
mayo 9, 2022, 1:37 pm

>193 andyl: To repeat what you said: you have never used X before, but you are sure that X works perfectly, and that you can make it work perfectly, and also that I know nothing about either X or about computers (just because you imagine I do not).

197andyl
mayo 9, 2022, 2:06 pm

>196 faktorovich:

No - I think you have demonstrated your lack of knowledge. A complete lack of knowledge about character encoding, about Excel and CSV files, and more.

Why am I sure I can get it working so quickly.
1) Other people contributing have no problem - so there are no major breaking bugs. Literally millions of people use R. Fewer use stylo of course, but enough that I have confidence there are no major breaking bugs.
2) I have decades of experience of working with open source software (I guess I am of the same vintage as keeline in that I also started with BSD 4.2 in around 1985), and I am a software developer by trade.
3) I have had lots of experience in learning (through use) of many different types of software - some without any documentation at all.

Now obviously I would only get the basic analysis that stylo does in that initial run. But it is enough that it would prove my point that getting started with R/stylo isn't difficult.

198Keeline
Editado: mayo 9, 2022, 6:58 pm

>190 faktorovich:

Any tool that is 40 years old, and might not have been edited since it was created is most likely inferior to a tool that performs the same function that was created in the past decade, as computing has leaped forward.

Wrong again. What do they say about their tool (emphasis added)?

Our POS tagging software for English text, CLAWS (the Constituent Likelihood Automatic Word-tagging System), has been continuously developed since the early 1980s. The latest version of the tagger, CLAWS4, was used to POS tag c.100 million words of the British National Corpus (BNC).

You have shown that you are prone to snap judgements to dismiss anything with which you are not familiar. You could have looked at the link I provided where you'd see the above quote in the first line of the introduction. But clearly you did not do so.

Since you have shown that you have your own definitions for certain phrases that are not to be found in dictionaries or websites, I have decided it is easier to ask what you mean by something like "proximity results".

The normal use of "proximity" is something that is physically close to another. That is why I noted the sentence and word number information that is in the output of CLAWS with the vertical output.

I will leave it to others with more experience to state whether "proximity" is used as you do. It is not something I have encountered nor will I look it up at this time.

"false nonsense"? Who is the one who throws accusations that the participants in this forum are all on the payroll of the producers of free-to-use text analysis software like Stylo for R or CLAWS?

I want the insults and accusations to stop. It is not in keeping with the terms of service of LibraryThing. If you have nothing that you can possibly learn from anyone else, then there is little point in continuing this thread.

James

199Petroglyph
Editado: mayo 9, 2022, 3:08 pm

>187 anglemark:
>195 faktorovich:

From Nevalainen (2006, p. 94), a standard textbook:

{T}he perfective structure was well established in Early Modern English. One difference between the Present-day and Early Modern English constructions is that Early Modern English normally preferred the auxiliary be with verbs of motion (for example, arrive, come, depart, enter, fall, go, land, return, ride, run, sail, set) and change of state (for example, become, change, grow, melt, turn, wax ‘grow’). Come is illustrated in (10). The preference changed in the course of the Late Modern period, when have replaced be.

(10) Al thes are come (sayde he,) to see yow suffer deathe; there ys some here that ys come as farre as Lyengkecon {Lincoln}, but I truste ther commynge shal be yn vayne. (HC, Thomas Mowntayne, Narratives of the Days of the Reformation 1553: 203)

Nevalainen, Terttu. 2006. An Introduction to Early Modern English. Edinburgh Textbooks on the English Language. Edinburgh: Edinburgh University Press.

200Petroglyph
mayo 9, 2022, 3:02 pm

>182 prosfilaes:

"Your conclusion regarding "furries" cannot be proven factually because for starters there were no furry-like costumes that were worn in the Renaissance, and your argument requires proving what they did in their private lives."

Such a nice setup. I lol'd at that.

201Petroglyph
mayo 9, 2022, 3:06 pm

>188 faktorovich:
for starters there were no furry-like costumes that were worn in the Renaissance, and your argument requires proving what they did in their private lives.

You don't know that. People in the 1500s had easy access to pelts and furs. If you read between the lines, it may even go back all the way to the intimate friendship between Gilgamesh and Enkidu.

202lilithcat
mayo 9, 2022, 3:20 pm

"Exit, pursued by a bear": https://www.yalehistoricalreview.org/exit-pursued-by-a-bear/

203Petroglyph
mayo 9, 2022, 3:48 pm

>202 lilithcat:

Making the beast with two backs.

There's the "falling in love with a donkey" bit in Midsummer Night's Dream, as well.

We may be on to something here...

204andyl
mayo 9, 2022, 6:49 pm

>196 faktorovich: >197 andyl:

I have just done the experiment.

I installed R onto a linux server and installed stylo as a local library, and used it to analyse 3 texts using the defaults. It took me approx 10 mins 45 seconds - it would have been less but stylo compiles some bits of itself when installing.

In some senses my setup would have been less easy to use, and less effective, than the standard as I do not have X on the server so no nice Tk interface for me.

205faktorovich
mayo 9, 2022, 8:52 pm

>197 andyl: If there were no bugs in Stylo/R/other popular computational-linguistic attribution methods; then, the Renaissance would have been previously accurately attributed, instead of having new studies varying between introducing new re-attributions of isolated texts, and re-affirming ancient attributions that stem back to the Renaissance itself. The fact that the currently accepted bylines are ridiculously incorrect, as I explain across BRRAM, is the reason these methods have been proven not to work. The enormous number of bugs that make these tools unusable and uncheckable is the reason specialists in this field have been able to avoid being discovered manipulating results or providing fraudulent results that are designed to make the methods seem "accurate" because they re-affirm current bylines, and not because it actually establishes true authorship. The absence or insufficiency of coherent manuals or "documentation" is one of the ways this un-usability has avoided detection, as nobody can learn something without any guide as to how it works. This would be like learning a new language without a single dictionary or grammar book to describe its rules or definitions. Yet you are arguing that you can learn this undefined language and this makes you superior to anybody who can only learn things that are learnable. The only reason you can be making such a claim is either it is a language you have made up, and have failed to make a manual to allow anybody else to learn it; or you are simply making an untrue exaggeration designed to make yourself appear superior.

206faktorovich
mayo 9, 2022, 9:00 pm

>198 Keeline: How did they find 100 million words in an English language that at most has around 1 million words? You don't see anything suspicious about this claim? Software is only free to use if it can be used by itself without paying a computer programmer to finish it by adding tools that make its output usable in a practical linguistic experiment. I am making factual statements, with specific examples from your own statements across this thread. All of you are making false accusations that my method is "garbage" and that I am "prone to snap judgements", along with other insults you have tossed at me. My 27-tests method does work. I decided on this method after intricate research into the potential forms of computational-attribution (as I have explained previously, and document across my explanations in the BRRAM series). I have tested the largest corpus of texts and bylines ever tested from the British Renaissance. I am here to explain my method to all who are interested in it. If you are going to advertise your own method(s) as superior in a thread that is about my method and my findings; I am going to explain to readers any falsehoods or errors your bring into your insulting arguments tossed at me to drown out the significance of my findings.

207faktorovich
mayo 9, 2022, 9:06 pm

>199 Petroglyph: The Workshop texts I tested are mostly between 1570 and 1648, the text you cite and the unique feature you are mentioning appears to be from just before the Workshop's time. I previously gave these examples in "Restitution" where I found "have" in "have been", "have had", "will have", "would have" etc. All of these are the standard Modern applications of "have" and not this earlier variant you are referring to. Either way, any variant usages of "have" are not likely to make any significant impact on the 27-tests I applied to this corpus. Thus, such concerns are irrelevant to checking if an attribution method is accurate.

208faktorovich
mayo 9, 2022, 9:09 pm

>203 Petroglyph: I hope you will all spend the next 2 and 1/2 years pursuing this hypothesis, only to face a similar thread of insults as this thread has been.

209prosfilaes
mayo 10, 2022, 12:05 am

>191 faktorovich: It is absolutely not true that I "need to calibrate on a known corpus",

I really hope you never run a grocery store. "That's two grapes, and your scale says five pounds; you need to calibrate your scales!" "Such adjustments only distort the data; I know the scale is correct, because it is logically correct."

210Petroglyph
Editado: mayo 10, 2022, 1:37 am

>206 faktorovich:

Keeline, I apologize for butting in here.

HoW diD tHey FINd 100 MILlioN WOrds In an EngLISh LAnGuAge tHAT at mOST hAs AROuND 1 mILlIOn WORds? yOU doN'T See ANYthING suSpiCiOus ABOuT tHis cLaiM?

Faktorovich has jerked her knee and has produced more lies to smear people who are not Faktorovich.

A corpus of 100 million words does not contain that many unique words, obviously. It will contain millions of instances of the word the, a few tens of thousands of the word rather, and maybe a few thousand of the word neighbouring. 100 million is the number of tokens (all the occurrences of all the words), not the number of types (the unique words).

Take this smallish corpus of 100 English novels of the 19th and early 20thC of 14+ million words. It contains 14,349,539 tokens spread out over 80,636 types. The word the alone accounts for over 716,000 of those 14 million tokens.

Here is a screenshot of the 68th-82nd most frequent words:

This is the range that is still dominated by function words, but where high-frequency content words show up more and more (know, see, time, good).

The Corpus of Contemporary American English (COCA) contains 1 billion words (tokens). The British National Corpus (BNC) contains 100 million words (tokens) from the 1980s up to 1993. The open-source version of Early English Books Online (EEBO) has 755 million words (tokens), spread out over +25,000 texts. The Proceedings of the Old Bailey, a collection of just under 200,000 courtroom trials held at the Old Bailey in London between 1674 and 1913, contains 120 million words (tokens).

And these are fairly constrained corpora, aiming for one variant of English, or one period, or one coherent group of texts, with very strict admission criteria. Web-based news corpora are up to 15 billion words (tokens). And even these are not random crawlers harvesting texts here and there and adding them willy-nilly. The sources are carefully chosen and vetted.

How do I know Faktorovich is lying? She proudly touts the 7.8 million words (tokens) of her own corpus.

Any and all pretensions to presenting a corpus larger than 1 million words as suSpiCiOus end right there. It's flailing, unfocused lies and smears.

No-one who knows anything about corpus linguistics should seriously, knowingly produce a quote like the one above.

Faktorovich, Keeline is quite correct in stating that your methods are garbage (they are garbage from front to back, and so are your results). He is quite correct in stating that you are "prone to snap judgements" that, somehow, always affirm your opinions regardless of their connection to reality (which often goes through at least one filter of misunderstanding). You demonstrate those snap judgements (or "brisk impressions" as you call them) over and over again.

211Petroglyph
mayo 10, 2022, 1:30 am

>208 faktorovich:

Don't plagiarize our idea! The timestamps are right there in the post headers!

212Petroglyph
mayo 10, 2022, 1:35 am

>207 faktorovich:

Lies and deflection.

From a quick search through Romeo and Juliet:

My sword, I say. Old Montague is come
And flourishes his blade in spite of me.

the guests are come, supper
served up

Make haste. The bridegroom he is come already

213Keeline
mayo 10, 2022, 1:47 am

>206 faktorovich:

How did they find 100 million words in an English language that at most has around 1 million words? You don't see anything suspicious about this claim?

You must work pretty hard to overlook obvious statements. I'll do your work on this for you. Details on the corpus can be found here. I don't know why they don't have https (an SSL certificate to encrypt traffic) but my browser (Chrome) complains of this.

http://www.natcorp.ox.ac.uk/corpus/index.xml

The 100 million words is not 100 million different words but the number of words in the corpus.

Oxford Dictionary has 273,000 headwords; 171,476 of them being in current use, 47,156 being obsolete words and around 9,500 derivative words included as subentries.

https://en.wikipedia.org/wiki/List_of_dictionaries_by_number_of_words

Software is only free to use if it can be used by itself without paying a computer programmer to finish it by adding tools that make its output usable in a practical linguistic experiment.

OpenOffice is a free (free as in beer, free as in freedom because the source code is available to those who care to look at it) open-source software. It includes a word processor comparable to Microsoft Word, a spreadsheet like Excel, and a presentation program similar to PowerPoint and other tools. However, it will not write a novel for you nor even your BRRAM series. What kind of world should do everything for you for free without the slightest amount of work?

What I did to string one Unix utility to another with the pipe characters (|) does not qualify as programming. It is a tiny bit clever perhaps but it is just using the tools just as a furniture maker would use several tools (saw, planer, drill, screwdriver, sander, etc.) to make a piece of furniture. There is a traditional sequence that these tools are used and each requires a bit of skill to use them well. Here's what I wrote before:

cat TS17-CLAWS.txt | awk '{ print $NF }' | sort | uniq -c

cat filename — displays a text file line by line
awk '{ print $NF }' — gets the last word from each line (the parts of speech tag) and displays it
sort — performs a basic alphabetic sort, leaving duplicates intact
uniq -c — takes a sorted list and produces a list with a count of the number of occurrences (-c)

This is not magic. It is basic processing of a text file to extract what you want from it. It is on a par with writing an Excel formula to perform a tabulation or a conditional one if you want to make it a little more advanced.

my method is "garbage"

I have said no such thing. I have expressed doubts and asked questions. You asked for them.

Extraordinary claims require extraordinary evidence and persuasion.

You claim that you are rewriting the bulk of literary history with your findings. Yet, you don't think it is important to demonstrate that your method works for texts of 100% known authorship. It is beneath you, apparently, from your repeated vociferous reactions in this thread, including #191 most recently.

I have looked at more than a few scholarly articles about authorship attribution studies and a couple books on the topic. It is often suggested that beginners with a technique use known writing, such as their own, to get familiar with the results produced. Here is the advice from the authors of one form of authorship analysis:

Starting with Yourself: Making a Graph of Your Own Writing or Speaking

Before attempting to use ______________ for any specific problem or attribution, it is always a useful — indeed necessary — exercise to become familiar with how the technique works. If the method is to be used, the user needs to know exactly how a sample of utterance can be attributed, and how it can be differentiated from the utterance of other persons.

The first necessity is to obtain a sample of writing or speech whose origin is beyond dispute: you, the user, need to be sure of who wrote your sample, or uttered the spoken words of your sample since

the integrity of the text

is of first importance when making experimental use of _____________. The most obvious source of such a sample for all of us is ourselves: letters, an essay or obvious source of other written piece, a talk or ordinary recorded conversation, will provide suitable material.

I have left off the name of the method since I don't want to be accused (again) of shilling for someone's method that is not your own. Suffice it to say that this method has been around for 30 years and, while it has some detractors as all methods do, it is reliable and trusted enough to be admissible in disputed document analysis in British courts.

This is not the only case where the author(s) of a method advise getting familiar with the results of known texts but it is the example I will offer.

If someone devised an attachment for an automobile and they claimed that it displayed the speed of travel, would you not expect them to calibrate it against known units? If you bring home a barometer, it is often necessary to set it against a known unit. Likewise, would you want a physician's assistant to take your temperature with a thermometer that had not been tested and marked against a known standard?

I have also stated that a tenet of the scientific method is to devise experiments that isolate for single variables that affect the outcome.

They do not try to bury the results by throwing out data without a good reason. One such reason is that they are mindful of significant digits in the measurements and the resulting calculations. If a measurement has only two significant digits like 1.5 inches, it doesn't matter what other calculations you make, you can't justify more than two significant digits in the output. Engineers taking physics classes had this problem all the time. Their calculator showed 8 or 16 digits and yet their measurements were rarely beyond 3 significant digits.

But my opinion holds no weight with you because you do not see me as a peer. I'm just part of the rabble, nipping at your heels, apparently. There's nothing in what I write that can contribute to or add perspective. You have a conclusion at the start of your project and you seem to want to bury the reader with walls of text, repeating the same six names whom you claim wrote just about everything of importance in the British Renaissance, nowadays called the Early Modern period by scholars.

Why is it so important to unseat the authors? What is your purpose behind it? Are you like an Oxfordian who feels that the son of a glove-maker in Stratford could not have possibly had the education to write the plays and poems which have such high esteem in English literature? Or are you a Baconian who finds numerology and hidden messages buried in every text like the authors of The Bible Code and a tongue-in-cheek article a bookseller I know wrote called "The Bobbsey Code" in reaction to that book?

You seem steadfast that every study that points to a certain collection of results you derisively call "bylines" are contrived and staged to end up with a certain result. Yet, you should be mindful of this quote which can raise more doubts than followers to your claims and methods:

Every accusation is a confession

James

214Aquila
mayo 10, 2022, 3:10 am

>188 faktorovich: Your conclusion regarding "furries" cannot be proven factually because for starters there were no furry-like costumes that were worn in the Renaissance.

Nick Bottom would beg to differ.

215anglemark
mayo 10, 2022, 3:44 am

Keeline, thank you for your posts here, which I for one appreciate very much. In fact, I suspect there is only one participant in these discussions that doesn't appreciate them, but speaking for myself as a linguist who barely knows the rudiments of programming, I find it very helpful indeed to get the kind of pedagogical explanations you have provided in, for instance, >170 Keeline: and >183 Keeline:.

Unfortunately, I don't think it will help Faktorovich much if I point out that https://corpus-analysis.com/ , which you (Keeline) linked to in >148 Keeline:, lists several tools that will create readable output of a CLAWS-tagged text. Including AntConc, which was mentioned in the previous thread as well. I can't believe I wasn't aware of corpus-analysis.com; I have it bookmarked now.

As for "proximity results", not a term I'd ever heard used before. To be honest, I thought she was referring to collocations (sequences of words that co-occur more frequently than they would do if it were random.)

-Linnéa

216andyl
Editado: mayo 10, 2022, 4:09 am

>205 faktorovich:
Earlier you said "stylo cannot even manage a basic pre-programmed experiment without destroying the data fed into it, and it can only accept files that are posted on websites, and not uploaded into its system." and "Based on my previous testing, R/Stylo has a glitch that blocks usability of their subdirectories, as it looks like files are added to the correct subdirectory, but Stylo does not see these files, or produces numerous errors and only recognizes files when they linked to from websites."

To explain your lack of being able to use the program. Your statements are factually incorrect. Stylo saw my files with no problems, it did not produce errors, and my files have never been near any websites. It did not destroy my corpus.

Your original gripe was that it was unusable gargbage.

Now you are moving your objection to some set of bugs that doesn't find the correlation that you do. You also say The enormous number of bugs that make these tools unusable and uncheckable is the reason specialists in this field have been able to avoid being discovered manipulating results or providing fraudulent results

I think that is an extraordinary claim and you cannot back it up. The source code for stylo and for R are freely available. For example I cannot say whether a particular test is a good one or a bad one for a corpus (I don't have the knowledge or experience in the field) but being a software developer I can look at the code and assess its quality and tell if the particular function has been implemented poorly (with loads of bugs) or well. If you look at the source code you can see where bugs have been eliminated as people (computational linguists and developers and the original authors) have done this.

Of course that doesn't mean there aren't still some very subtle, very minor, bugs in the code but there are not major ones like you describe. Those get found out real easy and stomped on.

This would be like learning a new language without a single dictionary or grammar book to describe its rules or definition
You mean like we all learned our first language? I certainly don't remember being sat down with a dictonary and grammar book when I was 6 months old and starting to speak English. I may have used a dictionary in primary school for some difficult words but not that much. I certainly didn't use a grammar book in primary school or secondary school.

The only reason you can be making such a claim is either it is a language you have made up, and have failed to make a manual to allow anybody else to learn it; or you are simply making an untrue exaggeration designed to make yourself appear superior.
Well that isn't completely fair.

1) I literally used 3 commands in R. One to install stylo. One to say that I wanted to use the stylo library. One to run stylo with the default settings. It is hardly rocket science. The rest, creating the corpus directory, copying files, looking at the results was all done outside of R.
2) R and stylo do have documentation
3) There are youtube videos showing you how to do simple stuff in R and stylo
4) Even if I had to write my own functions in R, then the language is not completely out there on its own, it is not some weird thing that shares no similarity with other languages.

217prosfilaes
mayo 10, 2022, 9:53 am

>205 faktorovich: The absence or insufficiency of coherent manuals or "documentation" is one of the ways this un-usability has avoided detection, as nobody can learn something without any guide as to how it works.

There's an R for dummies book, for crying out loud. For Dummies, Manning, O'Reilly, Packt and Chapman & Hall each have multiple books on the subject. You're talking about one of the most solidly documented programming languages. Whining about documentation is simply an excuse.

218Keeline
mayo 10, 2022, 10:20 am

If one wants to become acquainted with Stylo for R, these research papers may have interest. I was able to download the PDFs for free with an account on the site so trust others can too.

https://www.researchgate.net/publication/313387787_Stylometry_with_R_A_Package_f...

https://www.researchgate.net/publication/345358488_Stylistic_palimpsests_Computa...

As I have time I will read them more carefully to see if they are as helpful as they appear to be on the surface. I have loaded them on my iPad for offline reading.

James

219faktorovich
mayo 10, 2022, 11:55 am

>209 prosfilaes: A better example is if you come into a store with your own scale that has been giving you precise data for years. You have weighed your two grapes with your own scale. Then you come to the counter and the attendant weighs them and tells you they are "five pounds, and two grapes have always weighed five pounds." Other attendants come up with their own rigged scales and they all insist that they have "calibrated" their scales and according to all of them the 2 grapes are "five pounds". And they insist that you must "calibrate" your scales, so that on your scale the 2 grapes would also weigh exactly "five pounds, or the 'correct' amount that has been agreed to by all of the attendants".

220Petroglyph
mayo 10, 2022, 12:02 pm

>218 Keeline:

The README file over on Github has detailed installation instructions (including a few caveats for mac users).

There's links there to Howto pdf files that tell you what the functions do and how to use them.

It also links to a three-part set of instructional videos by Maciej Eder, the main developer behind Stylo. It shows you how to use the tool, and what the various features and settings mean.

There's more useful tips there, but I'll end by including a link to the Computational Stylistics Group, whose projects page lists a few small-scale test projects you can do to get comfortable with the software.

221faktorovich
mayo 10, 2022, 12:09 pm

>210 Petroglyph: For the purposes of tagging words, it is absurd if the corpus of the words you are tagging includes any words more than once, as you only have to tag each unique word once for it to be automatically marked with the specified tag when a program is applied to any corpus fed into this system for tagging. So to say you have a corpus of 100 million tagged words, when you actually have tagged a million words or less, and the rest are repetitions of the same words in different texts is absurd. A tagged corpus is basically a dictionary of all the known words with tags identifying their categories; so it would be absurd if a dictionary had annotations next to words that some words appeared 1,000 times in the texts consulted while making the dictionary, while others appeared only 5 times etc. Using the 100 million figure is just designed to make the dictionary seem enormous, even if the actual number of tagged words might have been far less than the 1 million actual English words that exist in some dictionaries that could have been tagged. Tagging 1 million words would take a lot of labor, tagging 100,000 words, while making it sound as if you have tagged 100 million words takes a lot less labor, and sounds like a much bigger achievement.

I specify that I tested 284 texts with 7.8 million words between them because all such studies cite this figure to indicate the size of the corpus, and my study has more such words than all previous computational-linguistic studies from the Renaissance. I do not say that I have tagged 7.8 million words, or that there are 7.8 different words in the corpus. Instead, I have a column in one of the GitHub tables that lists the number of words in each text, and this column adds up to 7.8 million words. While an attribution study does not need to cover more than 284 texts with 7.8 million words to reach a firm attribution even about this highly collaborative Renaissance corpus, a tagging system absolutely should include more than 100 million words from texts if its goal is to create a dictionary of as many unique words as possible, and in fact it should not be looking at texts, but rather focusing on dictionaries that parse out the 1 million or so unique English words. A corpus can have 100 million words in it, but only 70,000 unique words, so using a fragment of texts is a nonsensical approach to such tagging.

Your barrage of personal attacks in your conclusion reaffirms that you have no logical basis for your tirade, and you are just howling at the moon with nonsensical rage.

2222wonderY
mayo 10, 2022, 12:16 pm

>219 faktorovich: Arrogant much?

223faktorovich
mayo 10, 2022, 12:19 pm

>212 Petroglyph: You appear to have found Percy's linguistic quirk. He might also have used this quirk in this particular text to signify a unique way of speaking for Capulet, as two of these fragments are from his lines, and the third is from a Servant. If this was a common problem that would influence the results of a statistical attribution method, you should find far more than 3 words that are influenced by it in any given text with around 30,000 words. So, go ahead and check this text and let us all know if these are the only 3 cases, or if there are more instances how many there are. If this glitch impacts 3 words out of 30,000 that's .01% of the text. If 10% of this text are verbs (3,000) then 3 verbs out of these 3,000 make up only .1% of all verbs, so this would make absolutely no impact in the attribution of this text. And if this glitch had occurred in 10% of all verbs in the text (which is an absurd concept to imagine) then it would an extreme linguistic (poker-like) tell that would skew the data towards a more precise attribution of this text to Percy (if he overused this verb variant); again a grammarian might look at the results and question if the verbs are interpreted correctly, but if they are tagged consistently as the same tag-X versus tag-Y etc., it would be a strong indicator of authorship.

224amanda4242
mayo 10, 2022, 12:25 pm

>219 faktorovich: ???

Do you...not understand what calibration is?

225paradoxosalpha
Editado: mayo 10, 2022, 12:27 pm

>221 faktorovich: you only have to tag each unique word once for it to be automatically marked with the specified tag when a program is applied to any corpus fed into this system for tagging

I'm far from expert at this sort of technology, but this claim looks incorrect on its face. Don't homographs complicate the matter?

>224 amanda4242: She's made quite a few posts to demonstrate that she does not.

226faktorovich
mayo 10, 2022, 12:54 pm

>213 Keeline: You are saying that the 100 million words claim is similar to Oxford dictionary's claims about having "273,000 headwords; 171,476 of them being in current use, 47,156 being obsolete words and around 9,500 derivative words included as subentries." Thus you are suggesting that there are similarly "100 million" headwords, with the other variants within this system. This is entirely different from claiming there are 100 million words in a corpus of texts that had been tagged to create possibly only 100,000 tagged words. By creating this comparative parallel you are thus exaggerating the claim to make the 100 million figure sound even greater and all-encompassing.

To defend my accusations against you guys regarding your advertising of Stylo; you are using "OpenOffice" being free as a defense. This is like if I used the argument that air is free, when questioned why I have blocked water access to the city, and started charging all residents to get water from my pump.

My objection is not that programming is not easy, but that the tools you recommend are unusable even for an advanced programmer because the programs have glitches in them. And in cases where a program can generate tagging without major glitches (at first glance), it requires a programmer to create a program to put the data into a usable format. My computational-linguistic method is designed so that I do not have to help users who want to follow my method at all. They can use free tools and the steps I have explained for free in this discussion. If you method was just as good, it should be at least as easy to use your method. And instead, as you keep repeating, all who use your method must be computer programmers. The number of people in this field has been shrinking: https://datausa.io/profile/soc/computer-programmers#:~:text=The%20number%20of%20....

I have demonstrated my method works for texts, but there is no such thing as "texts of 100% known authorship" because any text with a firm byline could have been ghostwritten, especially if no previous study has compared it with a significant portion of texts with other bylines in a given genre. Any method that relies on any line as being "100%" known simply because it is placed on a text is biased in its core, and cannot produce accurate results.

It is a plagiarism to cite a source without giving credit for where it is from. Somebody can search for words in this fragment to learn the intended method, so you are not hiding anything by leaving out the name; you are still advertising this method.

And I have tested my own texts, and they do show that they match each other linguistically. I have tested texts from several different centuries, like the dozens of texts I tested from the 18th century, and several from the 19-20ths centuries. I have established that it works. My results for the "Lunch" experiment mostly matched the results Petroglyph had, with the exception of the ghostwriting intersections, co-authorships my method spotted that Petroglyph's analysis ignored (because his data was far less precise, and because he just ignored the same abnormalities when they did come up). Since most of my results matched the established bylines for this "Lunch" experiment, this should have been sufficient to stop all of you from repeating the criticism that I haven't checked it against known "bylines". But your point is that all computational-linguists must manipulate their data to make it look as if nearly all known bylines are correct; this is unethical, and yet you keep repeating it with progressive loudness.

"Renaissance" is the correct term because Harvey's earliest published books under his own byline discussed the revival of Greco-Roman classics into new variants in English, and he is likely to have taught Percy and Jonson ("Shake-speares") these lessons that they applied in their adaptations of Greco-Roman dramatic rules both to Greco-Roman and English subjects. The revival of ancient knowledge via its publication for the first time in printed books was essential for starting a global education process.

Scholars have been questioning (adding and subtracting) authorship claims from the Renaissance until the present moment. A significant portion of all scholarship has been about determining authorship. There is nothing new about my work to "unseat" any "authors" from bylines. Yes, most previous attribution studies have been wrong because they have focused on a byline's education-level, or hidden messages. That's why most of the texts were initially anonymous, but most of these have since been assigned bylines by scholars. I have established precise credits for the Renaissance's texts with a systematic quantitative method. There is nothing "hidden" about a "Nashe's Ghost" byline. The evidence I am finding in the texts and documents is not hidden, but it has been ignored by scholars who have been influenced by their intuition, whereas I am driven by data and facts to arrive at the precise, proven truth.

"Every accusation is a confession" is something a dictator would say as he threatens a reporter who is accusing him of corruption of finding dirt on him instead. As I stated previously, I have absolutely nothing to hide. So every accusation I make is based on firm facts about those who I am accusing; and there is nothing confessional in my statements about myself.

227faktorovich
mayo 10, 2022, 1:00 pm

>216 andyl: I have already proven that Stylo/R has bugs in its system that make it unusable for attribution. Just refer to my previous explanation. Unlike you, I am not going to repeat myself.

228Petroglyph
mayo 10, 2022, 1:04 pm

>224 amanda4242:
>225 paradoxosalpha:

She does not. At all. Like, the whole point of it has passed her by.

>225 paradoxosalpha:

Yes, a record vs to record; do as a main verb vs do as an auxiliary; better as a comparative adjective vs better as a verb. Discriminating between those often requires looking at the POS of the surrounding words: the noun record will often have a determiner (a or the) in front of it; the verb will be surrounded by adverbs, nouns, or the infinitival marker to, to give an example.

Also, obviously, you don't tag a corpus with POS just to get every unique word assigned a tag. Things like that already exist, and we call them "dictionaries". That is part of their raison d'être.

In fact, tagging software often makes use of such a pre-prepared lexicon -- a list of words that have been tagged with the correct tags (the two records and betters etc. will have been included twice there, with the relevant tags). When tagging large portions of text, the software will treat the lexicon as a lookup table.

And before Faktorovich chimes in with her utterly ignorant drivel: this is how analyzemywriting.com does their tagging. They use a particularly unsophisticated one (e.g. all instances of have and do are marked as auxiliary, regardless of actual syntactic context). But that's how they do it.

There are other methods, but they are more complex and I don't feel like explaining them right now.

Anyway. You tag all the words in a corpus with parts of speech so you can look at things like passives (the books were gathered, you got served, the collection that was looked through) -- which, in English, is a particular combination of a few auxiliaries and a participle. You really can't do that in any other way over large quantities of text.

POS tagging allows you to use corpora to look at things like the number of adjectives before nouns; the alternation between attributive (a harsh complaint vs the complaint is harsh), adjective complementation (e.g. accustomed to luxury vs accustomed to leading a life of luxury vs the life they were accustomed to lead -- the distribution between noun, -ing form and bare infinitive); the difference between This is obviously wrong and Obviously, this is wrong (i.e. from verbal adverb to sentential adverb). And thousands of other bits of variation and language change that involve a syntactic, collocational or contextual component. (Hint: that is all of them.)

You can't do that unless you have millions, or even hundreds of thousands of words tagged.

Like, that is the entire point of doing this. Just tagging every type/unique word once is... absolutely ridiculous? Like, who in their right mind would even think that that's how things work, or how they are supposed to work?

Faktorovich does not know what she is talking about, at all. But she'll sure try to sound knowledgeable.

229faktorovich
mayo 10, 2022, 1:05 pm

>218 Keeline: Many books can be written about a subject without the thing being described being as good or usable as the propaganda-books insist it is. I recently watched a documentary about the Marcos family in the Philippines sponsoring a Facebook campaign to puff their achievements to make them sound like saints, after they have been found guilty of stealing billions from the country's citizens, as one of them is trying to become President yet again. I checked the available manuals and queried everybody here who is advertising for Stylo, and there was no possible way of making this tool usable for an actual attribution analysis. Even if all of the bugs were solved in this broken system, and it actually worked; my 27-tests method achieves 27-times more than the 1 word-frequency test that it at most can perform, as I have not seen any studies that use anything beyond the word test in Stylo.

230faktorovich
mayo 10, 2022, 1:08 pm

>220 Petroglyph: You are now advertising hundreds of Stylo users' projects without giving any actual tutorial links to how to perform even a basic word-frequency test in Stylo. If there were such instructions posted somewhere, you would have linked to them by now. The "first steps" such as "installation" does not at all help somebody who needs to get an actual result out of this program. Saying that a lack of glitches in an installation (and there were some glitches even during installation) is sufficient to prove a program works, is an absurdly low bar.

231faktorovich
mayo 10, 2022, 1:11 pm

>225 paradoxosalpha: Yes, homographs is one of many glitches that can occur during tagging that might miss-categorize isolated words in a text run through a tagging system. Some glitches occur in all systems, but this group has been attempting to argue that Analyze's tagging is inferior because they have found glitches in its results, when they have not performed the same identical tests in Stylo or the other programs they have been advertising to see if they would be any better at spotting homographs' intended meaning etc.

232faktorovich
mayo 10, 2022, 1:18 pm

>228 Petroglyph: If you have 2,000 sentences with half that include one of these phrases "accustomed to luxury vs accustomed to leading a life of luxury". The 1,000 example of this phrase is not going to be any more useful at explaining the proper tagging of a word in relation to the words next to it. There are grammatical rules behind word-type-order that would establish the required tagging far better than if a computer is fed examples. Only if a human interferes and reviews these 100 million words to see if they are tagged correctly, and edits the cases of tagging by programming each special case with how it was supposed to be tagged would the presence of this large corpus make sense. Otherwise, the system just needs to know what word-order is most likely to lead to tagging with X vs. tagging with Y.

233Petroglyph
mayo 10, 2022, 1:47 pm

>231 faktorovich:

Analyzemywriting.com uses a simple lookup list for its tagging. Stylo does basic text comparison (and some machine learning techniques as well). It is not a tagger, nor is it meant to be.

Are all pieces of software just interchangeable to you?

this group has been attempting to argue that Analyze's tagging is inferior because they have found glitches in its results

We've been arguing that AMW's handling of homographs is non-existent. Have and do are always categorized as auxiliaries, despite being main verbs sometimes.

Here is what AMW makes of "a record. to record" (so, noun and verb):

It only categorizes record as a noun.

Here are the results for "better results. to better your results" (so, a comparative adjective, and a verb):

It only categorizes better as an adjective.

Conclusion: AMW uses a lookup table that completely ignores homographs. And you have based all your POS tests on this.

Professional taggers don't do this. Look at the list of categories that CLAWS uses in >170 Keeline: "AJ0" = Adjecive; "AJC" = comparative adjective (e.g. better); "AJS" = superlative adjective (e.g. best). That level of detail is unachievable with AMW.

they have not performed the same identical tests in Stylo or the other programs they have been advertising to see if they would be any better at spotting homographs' intended meaning

It is not my responsibility, nor anyone else's, to convince you that better alternatives exist. It is not my responsibility, nor anyone else's, to demonstrate to you that things more sophisticated than AMW yield more reliable results.

As you've amply demonstrated in this thread, you will ignore criticism, smear it (and the people who offer the criticism) with knee-jerk accusations, and remain entrenched in your narrow comfort zone. Even if I went to the trouble of comparing TreeTagger or CLAWS with AMW, you'd simply remain unconvinced that your home-brew tinkering was in any way inferior to professional tools with decades of development behind them.

234Petroglyph
mayo 10, 2022, 1:58 pm

>232 faktorovich:

You don't tag in order to "explain the proper tagging of a word in relation to the words next to it". The tagging is not done just for the purpose of tagging more.

Tagging is done in order to do things with the results of the tagging, after that process has been completed. So you can extract all the hits of a relevant set of grammatical structures and constructions you want to investigate. Looking at "accostomed to":

The top 54 hits are of the type "accustomed to the NOUN PHRASE"; then follow a bunch of "accustomed to INFINITIVE". A properly tagged corpus would be able to give you the proportions of these two structures; it would allow you to search very precisely and directly. Looking at corpora from different time periods, this would allow you to track the changes in distribution between "accustomed to" + noun / infinitive / ing-form. (spoiler: the proportion of bare infinitives is way, way down these days compared to two, three hundred years ago; -ing forms are on the rise.)

Only if a human interferes and reviews these 100 million words to see if they are tagged correctly, and edits the cases of tagging by programming each special case with how it was supposed to be tagged would the presence of this large corpus make sense

You, who until recently knew nothing about POS taggers, think that a problem you literally just thought up has not been addressed? In the sixty years of development behind these algorithms?

235Petroglyph
mayo 10, 2022, 1:59 pm

>230 faktorovich:

Lies, deflection. Retrenchment in your comfort zone.

Your comfort zone is not the measure of research. Nyarlathotep be praised.

236Petroglyph
mayo 10, 2022, 2:04 pm

Hey, Faktorovich.

Instead of using strikethrough across William Shakespeare's name on your covers and title pages:

And instead of using scare quotes all the time ("William Shakespeare") in order to indicate that this attribution is passé.

Have you ever given thought to using Wouldiwas Shookspeared?

Please let me know your opinion ASAP. If you decide to use this name, I want to be cited.

237andyl
Editado: mayo 10, 2022, 2:13 pm

>230 faktorovich:

You sound like one of those whiny students who post to forums asking people to give them answers for their homework.

Getting a word frequency list is really easy. You can even find out how to do it in the non-existent documentation https://rdrr.io/cran/stylo/man/make.frequency.list.html

The above link was the second result when I googled for "R/Stylo word frequency"

238Petroglyph
mayo 10, 2022, 2:19 pm

>237 andyl:

Run any analysis through Stylo, and it will create one automatically. (The frequency lists is what many of the subsequent analyses are based on).

This frequency list will be dumped in the working directory as "table_with_frequencies.txt"

Faktorovich knows this: she asked for these files way back in the Oz portion of this thread. I provided those files to her then.

239andyl
mayo 10, 2022, 2:48 pm

>238 Petroglyph:

Yeah I know. That was relative frequencies, maybe she wanted absolute counts. I don't know if that is configurable in the Tk GUI that pops up - I run R on a machine without any graphics.

But my comment also aimed to point out that there is very good documentation, and how a simple google search finds it in the first responses.

At the moment if she told me it was raining I would have to look outside to check.

240Petroglyph
mayo 10, 2022, 3:37 pm

>238 Petroglyph:

I see. My mistake, carry on!

(yeah: as R goes, stylo is very well documented)

241amanda4242
mayo 10, 2022, 3:46 pm

>239 andyl: At the moment if she told me it was raining I would have to look outside to check.

I've been thinking much the same thing for months, but worded it as "If she told me water is wet I would seek independent verification."

242faktorovich
mayo 10, 2022, 8:51 pm

>233 Petroglyph: I am still waiting for one of you to use the tools you are recommending to create anything like a column that measures the percent of nouns, etc. in a text that would actually move the process through the first step of measuring the percentages of word-types. You have given a tagging service, and have managed to create a program to add these up, but you haven't even attempted to add up all of the nouns categories together in your data; and we are nowhere near a simple one-stop tool that is free and can achieve a count of word-types without the user having to write a program of their own.

And as I said, the method has to have mathematic consistency and not exact grammatical accuracy. Always measuring "record" as a noun, even when it is a verb is not a quantitative problem if the question is not how many nouns are in a text, but rather how many words in a given text are in category X vs. category Y etc. In other words, it would be like counting the frequency of words in a category that makes up 10% of a text (more or less) as opposed to merely counting most frequent, or medium frequent independent words. Grouping them by word-type achieves a uniquely accurate measure of linguistic style. The failure of a tagger to distinguish nouns and verbs that are spelled identically to each other is not relevant as a given term like "record" is always consistently tagged, and thus still identifies the user as somebody who prefers to use this term over other options. If it bothers you greatly, you should be writing to the programmers of this system to ask them to volunteer to help them for free to fix this and other glitches you have noticed. Analyze is the best and most relevant tool I have found, and until you have a rational alternative, you are just criticizing something that works, while advertising tools that you haven't even explained yet how they could work to achieve the same result (or if your alternative might not generate the same glitches: you have not being testing the same words with both systems to check if there is a difference, or if programmers have just not managed to solve these problems yet).

Distinguishing between "comparative adjective (e.g. better); "AJS" = superlative adjective" does not require the same type of complex analysis about word-order, or neighboring words involved in the recognition of differences in meaning between "record" when it is used as a noun or a verb. There is no category for "homographs" in the list CLAWS offers; so it does not solve the central problem you have described in this post. So while it seems as if there are more adjective types; an ideal system is not one that distinguishes the highest number of word-types, but just one that captures the larger divergences between texts and presents data in a format that is applicable for all non-specialist researchers.

243faktorovich
mayo 10, 2022, 8:57 pm

>234 Petroglyph: While there are some linguistic studies where finding all occurrences of a phrase + infinitive might be relevant; it is entirely irrelevant for a computational-linguistic method that is interested in the relative number of broad word-types in the text to establish the characteristics of a linguistic signature. It would also be just as efficient to search for all instances of "accustomed to the..." in the text and check what follows, versus creating a program that categorizes these and then still reading through a long list mostly with single-appearances of each uses per-text.

244prosfilaes
mayo 10, 2022, 9:04 pm

>226 faktorovich: I have demonstrated my method works for texts, but there is no such thing as "texts of 100% known authorship" because any text with a firm byline could have been ghostwritten,

On one hand, all data has errors. You try to minimize them and hope they don't affect the project much. On the other hand, you're saying that most works have ghostwritten, in which case, we're at square one. Show me some evidence that any of this is in any way reliable for detecting authorship. There's no evidence that people have a consistent writing style distinguishable from others, especially those trying to mimic it. All previous tests were done on texts that were presumed to be of 100% known authorship, probably with tools like R/Stylo.

If you method was just as good, it should be at least as easy to use your method.

Which is, of course, why we produce paintings with crayons; if oil paints were better, they would be as easy to use. From my perspective, your method is terrible to use; you have to manually do all this stuff and copy results back and forth, whereas with Stylo I create my corpus, open the program and hit go.

"Renaissance" is the correct term

No. Whether you call it the "Early Modern period" or the "British Renaissance" is not a matter of correctness; it's a matter of labeling and communication.

Scholars have been questioning (adding and subtracting) authorship claims from the Renaissance until the present moment.

Now you claim you're doing nothing remarkable? Had you merely claimed that six ghostwriters had done almost all of the literary work in the time period, that would be a pretty audacious claim. It might not have got me coming back to this thread, because it's conceivable. Six ghostwriters wrote all the writing in the time period, including that by people like judges and preachers, whose job was in part to write, plus one guy wrote both the Protestant and Catholic pamphlets and translated the whole KJV? That's preposterous.

There is nothing "hidden" about a "Nashe's Ghost" byline.

No; there's no evidence that anyone has ever looked at it and assumed it was Nash, either. The fact that someone used a pseudonym once doesn't mean that every name is a pseudonym.

245faktorovich
mayo 10, 2022, 9:04 pm

>236 Petroglyph: The "Shakespeare" attribution is not "passé"; it is inaccurate. I use the quotation marks to indicate derision and the falsity of this attribution-claim, while mentioning the byline that is currently broadly accepted as accurate for a given text.

Meanwhile, you have started pirating front-matter pages with images from BRRAM. To avoid a blatant piracy violation, you at least have to cite my name and the source for the image you have copied; the operators of the museum that makes this image available specifies that a credit that it came from them must be provided with the image, and you have not provided this credit.

246faktorovich
mayo 10, 2022, 9:09 pm

>237 andyl: Word-frequency is indeed easy, as it can be generated with the program I cite in my book that generates a list with frequencies for all words in a text. There is no need to create a new program to count frequencies. We are not discussing general word-frequencies in the last few posts, but rather word-category frequency (nouns vs verbs). And your "documentation" does not provide the "easy" steps for how to create a frequency list, but rather just summarizes that it can be created. Maybe you should have kept searching beyond your second googled result.

247faktorovich
mayo 10, 2022, 9:24 pm

>244 prosfilaes: I am not saying that "all" or "most" of all texts ever written were ghostwritten. I have not checked nearly enough texts to reach such a conclusion. I have checked enough texts to prove that all, or at least all of the texts I checked from the British Renaissance were written by six ghostwriters. This is a statement of fact based not only on the quantitative data, but the other evidence that you are all working to ignore, as if it is not the overwhelming proof that guarantees my method has reached precisely correct attribution results.

"There's no evidence that people have a consistent writing style distinguishable from others, especially those trying to mimic it." If you are so biased that you believe this to be a fact, you should not go anywhere near authorial attribution, as you do not believe such attribution is possible, and to achieve something you have to at least imagine that you can succeed in an endeavor. "All previous tests were done on texts that were presumed to be of 100% known authorship". You are basically saying that all previous experiments have been designed to repeat the established byline, without any hope that they can actually distinguish between linguistic styles; this is exactly why all of these previous methods have failed to arrive at anything approaching accurate results.

Then, you are saying that if I had made a smaller claim by crediting fewer works to these ghostwriters, it would have been "conceivable", so you would have ignored my findings and not commented on them at all. But only because my claim is so humongous that you cannot "conceive" it can be true, that's why you are engaging in this conversation, or because you have assumed my claims must be false, and you have assumed you will easily win the argument? My conclusions are not "preposterous", or contrary to reason, as they are instead purely rational; they might only appear unbelievably irrational to anybody who believes to dogma of the established bylines. Similarly, the idea that humans evolved from apes appeared "preposterous" to religious folks when Darwin pitched this idea, and it still appears "preposterous" to some Darwin-doubters today; but science comes up with a true answer, and not one that sounds believable based on what people already believe.

248prosfilaes
mayo 10, 2022, 11:54 pm

>247 faktorovich: "There's no evidence that people have a consistent writing style distinguishable from others, especially those trying to mimic it." If you are so biased that you believe this to be a fact,

You don't understand science. The evidence that people have consistent writing styles is based off certain tests. If you don't believe those tests were accurate, then you're assuming something in your tests that you don't know to be true. Same thing with history; you assume it's wrong in so many ways, but you assume it's right when you want it to be.

My conclusions are not "preposterous", or contrary to reason, as they are instead purely rational;

Purely rational things are usually wrong in the real world. The truth of the real world does not come from arguing like Greek philosophers; it comes from poking at things, and trying to not let your preconceptions get in the way of what they're telling you.

they might only appear unbelievably irrational to anybody who believes to dogma of the established bylines.

I see you didn't read what I wrote. I don't know who translated the KJV; I do find the idea that some exiled author did the whole thing a lot more preposterous than a bunch of scholars. I don't know any of the religious authors bickering back and forth, and given religion and the law at the time, I wouldn't be surprised if many of them were written under pseudonyms. But the idea that there was one person writing everything for all sides... why? It's not even like most of that stuff sells well. Much of it is the 17th century equivalent of forums like these; people arguing back and forth over stuff that most people consider trivial, to be heard and to oppose the idiots on the other side.

The issues is not about any dogma of the bylines; it's that people don't behave that way.

Darwin

Actually, Huxley wrote all of Darwin's works; Darwin wrote under the name Adam Sedgewick, where he called the Origin of Species "A cold atheistical materialism".

science comes up with a true answer, and not one that sounds believable based on what people already believe.

"But the fact that some geniuses were laughed at does not imply that all who are laughed at are geniuses. They laughed at Columbus, they laughed at Fulton, they laughed at the Wright Brothers. But they also laughed at Bozo the Clown." -- Carl Sagan

249amanda4242
mayo 11, 2022, 12:19 am

And now for something completely different, a purely rational conclusion.

250Petroglyph
mayo 11, 2022, 1:35 am

>249 amanda4242:

Or the first few minutes of this sketch.

251Petroglyph
mayo 11, 2022, 1:53 am

>245 faktorovich:

you have started pirating front-matter pages with images from BRRAM. To avoid a blatant piracy violation, you at least have to cite my name and the source for the image you have copied

You mean this post?

The post where the image contains the title (highlighted in a yellow box), that starts by explicitly addressing the author (you, highligted in blue), that explicitly assigns the book to you ("your covers and title pages", highlighted in green), and that points out a noteworthy feature (highlighted in orange) that uniquely identifies Faktorovich's BRRM. That post?

How long do I have to keep that post up before your writing starts to suffer from the emotional distress (as in >114 faktorovich:)? How much money do you think your emotional distress is worth in a court of law?

And, more importantly, what do you think of "Wouldiwas Shookspeared"? You never answered that one.

Do cite me if you decide to use it!

252Petroglyph
mayo 11, 2022, 1:59 am

>243 faktorovich:

To paraphrase: "why use precise tool when blunt tool do trick poorly?"

It would also be just as efficient to search for all instances of "accustomed to the..." in the text and check what follows, versus creating a program that categorizes these and then still reading through a long list mostly with single-appearances of each uses per-text

Your manual, one-text-at-a-time mindset is showing again.

Take a look at this:

This is an overview of the collocates of the verb kiss (not the noun kiss!), specifically all the collocates to the right. They have been sorted by their POS (noun, adj, verb, adv).

"reading through a long list mostly with single-appearances of each uses per-text" my ass.

You've never really performed any corpus analysis, have you?

253Petroglyph
mayo 11, 2022, 2:30 am

>242 faktorovich:

I am still waiting for one of you to use the tools you are recommending to create anything like a column that measures the percent of nouns, etc. in a text that would actually move the process through the first step of measuring the percentages of word-types.

Every time you ask us to do your homework for you (thanks for that comparison, >237 andyl:), the list of demands grows longer. Now you are "waiting" for us to visualise the results in a bar chart, as well? And it should resemble AMW, apparently, too.

You've been pointed to the tools. Their workings have been explained to you. You've been given the correct search terms. Good luck!

Distinguishing between "comparative adjective (e.g. better); "AJS" = superlative adjective" does not require the same type of complex analysis about word-order, or neighboring words involved in the recognition of differences in meaning between "record" when it is used as a noun or a verb.

Distinguishing between better (comp. adj) and better (verb) does require looking at adjectival behaviour vs verbal behaviour. There's 62 different categories in CLAWS5. They're not just there to make finegrained-but-useless distinctions. They're there to permit future fine-grained linguistic analysis. That's why comparative and superlative adjectives are separated out.

while advertising tools that you haven't even explained yet how they could work

Do your own homework. Do you expect us to do the work for you? Aren't you the person who sees plagiarism everywhere?

There is no category for "homographs" in the list CLAWS offers

I lol'd at the child-like misunderstanding of that. Ah Faktorovich. You are guaranteed to misunderstand the world in a very unique way.

Always measuring "record" as a noun, even when it is a verb is not a quantitative problem if the question is not how many nouns are in a text, but rather how many words in a given text are in category X vs. category Y etc.

Right. Because when measuring the proportions of nouns, verbs and auxiliaries and so on, it doesn't matter if all homographs are systematically only assigned one category. If that is your stated opinion.

You do know that there are literally thousands of these in English, right? Thousands of zero-derivation nouns/verbs: list, support, kiss, record, order, release, find, use, bridge, work, holiday, move, bank, man, shut, answer, knife, skin, fish, wound, sound, paper, freeze, mask, make, ...? All of these can be a verb as well as a noun (to paper (a wall) vs a paper; man the ship vs a man on the ship). And then there's verb/adjective pairs: open, close, dirty, clean, warm, ....

How is this not a massive issue? How is "I tried command-line R once and I couldn't get it to work now I'm gonna whine online how it's all the bugs that make the thing unusable and its just more corrupt academics shilling their version of history" such a massive issue for you, but the systematic shortcomings of your own methods aren't?

You're never gonna admit you're wrong. Like a toddler who denies drawing on the wall.

254Petroglyph
mayo 11, 2022, 2:34 am

>246 faktorovich:

it can be generated with the program I cite in my book that generates a list with frequencies for all words in a text. There is no need to create a new program to count frequencies.

Ah yes. WordSmith. A corpus-linguistic tool that does Key Word In Context, collocations, concorcances and a few other things. Oh, and word frequency lists. Which is all that Faktorovich uses it for, and so there's "no need to create a new program to count frequencies" As though the point of the programme is to count frequencies and then do nothing with them.

Such a narrow slit to see the world through.

255amanda4242
mayo 11, 2022, 10:09 am

>250 Petroglyph: Ha! I haven't seen that one before. Thanks!

256susanbooks
Editado: mayo 11, 2022, 10:44 am

Requoting >219 faktorovich: in full since it so perfectly encapsulates every single one of the author's posts:

">209 prosfilaes: prosfilaes: A better example is if you come into a store with your own scale that has been giving you precise data for years. You have weighed your two grapes with your own scale. Then you come to the counter and the attendant weighs them and tells you they are "five pounds, and two grapes have always weighed five pounds." Other attendants come up with their own rigged scales and they all insist that they have "calibrated" their scales and according to all of them the 2 grapes are "five pounds". And they insist that you must "calibrate" your scales, so that on your scale the 2 grapes would also weigh exactly "five pounds, or the 'correct' amount that has been agreed to by all of the attendants"."

Could it never occur to you, in this hypothetical situation, that you've been weighing grapefruits all this time, and calling them grapes, faktorovich?

257faktorovich
mayo 11, 2022, 11:29 am

>248 prosfilaes: What you term as "arguing like Greek philosophers" is just how a rational argument is supposed to sound by the rules of rhetoric and logic, as described by writers like Harvey in his rhetoric textbooks.

Again, KJB was primarily ghostwritten by Verstegan, but he had help from Harvey as a co-writer, so he was not its only translator.

You are asking: Why would Verstegan have written propaganda for both sides under a monopoly on theological book publishing? Imagine if you had a monopoly today on publishing all textbooks, theology books, non-fiction books, the Bible (all versions). Even if you only sold a single copy of most of these books, you would sell so many copies of hit titles like the Bible, you would make enough profits to publish numerous other books. And you would receive ghostwriting payments from hundreds or even thousands of "authors" who needed byline credits to advance in their careers. And if a book was accusing somebody of scandalous things that book would sell more copies, and would prompt the opponent being slandered to need to hire you as a ghostwriter to publish a rebuttal. Yes, people behave this way; there have been plenty of ghostwriters who have asked for credit, like Trump's ghostwriter. You don't think there are people in the media today that threaten publishing revealing stories if somebody doesn't pay them to keep them quiet? Just the "National Enquirer" has been accused of extortion to cover-up scandals for Bezos, Trump and others: https://www.theguardian.com/technology/2019/feb/07/jeff-bezos-national-enquirer-... This is what humanity has been since at least the Renaissance; so solving current problems begins with acknowledging the existence of these problems in the Renaissance.

It is a funny history that six ghostwriters wrote the Renaissance; scandal is funny; discovering scandal where it seemed unlikely is funny; laugh if you are cheerful.

258faktorovich
mayo 11, 2022, 11:37 am

>251 Petroglyph: You might have semi-cited my name, but you have forgot to cite the image that you have left as the only item un-circled in your diagram. You see the image belongs to a museum that allows for its reprinting, only if the printer gives credit to the museum for holding the artwork and creating the digitization. So, the question is not if I am emotionally distressed in this matter, but rather if the museum might go after you for copyrights infringement; I have informed you that you have failed to give them credit, so they would be able to argue that you were fully aware of their request for a credit, and intentionally repeated the piracy after being informed. I don't even have to spend any time on this lawsuit as the museum can carry it out with their own lawyers.

259faktorovich
mayo 11, 2022, 11:45 am

>252 Petroglyph: You had to have searched an enormous corpus for their to be 1813 occurrences of the words "kiss" and "cheek" together. In a single 30,000 word text, there are going to be few instances of a word like "kiss" and so you would have to look through these types of pattern lists for nearly every word in the text. In a single text, there would not be 1813 occurrences that could be jointly categorized as noun vs. verb by checking only a single box in a tagging system that does not have an automated system for distinguishing between a noun and a verb. The 27-tests that I use are accurate for attribution, and the fact that I have performed them is registered in their data on GitHub.

260faktorovich
mayo 11, 2022, 11:53 am

>253 Petroglyph: You are all volunteering your thoughts on my research. The unsolicited homework you have been submitting is atrocious. I have just been investing my time into grading it for free and pointing out areas where you can benefit from improvement.

As I stated before, if all homographs are consistently categorized by a system under the word-category it believes each of them falls into; the overall analysis will still be counting the tendency to use or not to use them by the author in the data. An attribution method is designed to spot variation in style, so it does not at all need to be an accurate grammar-checker. The method I designed works with free accessible tools that anybody can replicate. I will not ask those who use my method to deal with the types of unpassable bugs that I encountered with Stylo. If there was a better counting program out there than Analyze, I would switch to using it; but my Renaissance computational-linguistic study is finished (unless I decide to add texts to it to solve mysteries at a later time).

261faktorovich
mayo 11, 2022, 11:58 am

>256 susanbooks: I see, let me adjust the scenario. You come into the store and try to purchase 2 grapes. You go through the same spiel, but after the calibration objection doesn't work to convince you to let the attendants "calibrate" your scale. The attendant insists, "Did you say that's a 'grape'? NO!!! That is a 'grapefruit'! You don't even know what a 'grapefruit' is! Ha, ha!" "But," you object, "they are the size of my thumb's nail, and they are purple, and you are insisting they are 'grapefruits'?"

262susanbooks
mayo 11, 2022, 12:36 pm

>261 faktorovich: And yet everyone else in the world says they weigh 2.5 pounds each. And they, of course, are wrong. Not you. Never you.

263anglemark
mayo 11, 2022, 12:50 pm

>262 susanbooks: Remember, they burned Columbus on a pyre before conceding he was right all the way!

-Johan

264paradoxosalpha
mayo 11, 2022, 1:21 pm

>263 anglemark:
Yes, Columbus was correct from the start: The American continents and the Pacific Ocean are hoaxes. And the non-Christians he enslaved had no right to the Indies he discovered.

265Keeline
mayo 11, 2022, 5:54 pm

>226 faktorovich:

>213 Keeline: Keeline: You are saying that the 100 million words claim is similar to Oxford dictionary's claims about having "273,000 headwords; 171,476 of them being in current use, 47,156 being obsolete words and around 9,500 derivative words included as subentries." Thus you are suggesting that there are similarly "100 million" headwords, with the other variants within this system. This is entirely different from claiming there are 100 million words in a corpus of texts that had been tagged to create possibly only 100,000 tagged words. By creating this comparative parallel you are thus exaggerating the claim to make the 100 million figure sound even greater and all-encompassing.

I have seen an estimate of 1 million words in English from a group in Texas.

https://www.macmillandictionaryblog.com/one-million-words-of-english

The number seems a little high since the largest dictionaries contain a little more than 1/4 this figure. Where are the other 722,000 words?

The numbers I cited are the numbers of words in the Oxford English Dictionary as cited on the Wikipedia page whose link I provided.

I make no such claim that 100 million words has any relationship to 273,000 distinct "headwords".

Since there is obviously some confusion, a corpus is a collection of writings. For the purposes of stylometrics, they are works that have been carefully digitized and edited to be true to the originals.

Let's consider a real example I have access to. If I have a collection of 110 "Tom Swift" books in digital form, that is a corpus. The number of words per volume will vary widely since some books are much longer than others. Typical Tom Swift Sr. books from 1910 to 1935 range between 40,655 and 46,989.

Altogether, this corpus has 1,599,037 words. That is merely the size of the corpus. The number of distinct words is much, much smaller. Making a rough count, the number of distinct words is about 23,618. This may be a bit high because I think some punctuation is getting in the way.

You have claimed that your collection of texts (a corpus) has X million words in it. Yet, I think you realize that the number of distinct words is much smaller. After all, you also claim the high estimate of 1 million words in English.
_____

When it comes to identifying the parts of speech for a word, it is a perilous assumption to think that all word should be counted as the same type. I and others have given examples of words which are spelled and sound identical but have radically different parts of speech identifications.

Identifying the parts of speech is difficult to do. Often it depends on where it is placed in a sentence whether a word like "lead" is a verb or a noun. If you tie your hands by working from a alphabetically sorted list of distinct words in the corpus, without the context, and assume the first parts of speech from a dictionary (or another arbitrary source like AnalyzeMyWriting.com), you will get an unacceptable contribution of noise to the counts of each type of parts of speech (nouns, verbs, adverbs, adjectives, etc.).

If part of your "fingerprint" for an author is based on usage of different parts of speech, suddenly the counts become important that they be as accurate as possible.
_____

In discussing the CLAWS5 tagset, I have already noted how the fine-grained granularity of parts of speech can be combined to count "all verbs", "all nouns", etc. You take the list of tags of interest and count all of them from the CLAWS5 output. Remember when I wrote something like:

egrep "NN0|NN1|NN2|NP0" input_file.txt | ...

The egrep command is like grep that searches one or more text files for the occurrence of a word. With this usage of egrep I have used a simple Regular Expression to get each of the different variants of tags for nouns — NN0 NN1 NN2 NP0 — and asked the system to show me the lines that have any one of those values. If I counted them, they would be an aggregate count of the nouns in input_file.txt. This is not particularly clever. It is merely using the tools available for text manipulation in a Unix-type system (including Mac OS X). If you are using Windows, you might need to look at CygWin to get similar functionality.
_____

As stated by others before. I am not your student, your minion, nor your sycophant. I will try to help with suggestions of tools that are out there.

You are all volunteering your thoughts on my research. The unsolicited homework you have been submitting is atrocious. I have just been investing my time into grading it for free and pointing out areas where you can benefit from improvement.

This forum and its replies are not homework for you to grade and declare to be "atrocious".

The reasons other tools are mentioned is not because of the results they produce which reinforce traditional bylines but rather because the code for something like R and the Stylo extension to it are open-source and have been evaluated by people who like to look for flaws in code as well as people who wish to use them and make them better. Not everyone who uses open-source software looks at the code or tries to change it. But the opportunity is there for someone who knows or will learn to read it or hire someone who can review it. This is part of the reason why open-source software tends to close security holes faster than proprietary software does. There are more people looking at it, looking for problems, and trying to fix them. With proprietary programs and operating systems, you can write in to the company like Apple, Microsoft, Adobe, etc. but good luck getting it to someone who will evaluate the report and initiate a fix if called for. There are problems with some of these programs that stick around for years without resolution even though the companies make large sums in the licenses to use the software.

For example, multi-core processors have been the norm for computers since 2000 or so (1/5 of a century or more). Yet, Adobe Acrobat "Pro" (quotes because I dispute that the expensive program deserves "pro" status; it is more like an amateur program) does not use more than one core or thread for its heavy tasks. If you want to do OCR on a PDF with scanned or photographed page images, only one core is used. On my machine with 6 cores, that means that 5/6 of them are idle and the whole process is slower than it would be if it ran things in parallel. There are open-source utilities to OCR PDFs that do make full use of the available resources. One is a command-line Python script called OCRmyPDF which uses the Tessaract OCR engine. When this is running, you know it because the fans spin up and the load meters show that the computer is working hard. Good programs like this will let you adjust how much of the processor to use.

On my MacBook Pro, which I generally like, there are long-standing quirks such as the Finder's inability (in the past few years) to find file names with certain words in them. Apple changes how things work and sometimes their own engineers don't know the details. It appears that if the full filesystem paths are too long, the storage in the database for the file names is clipped off thanks to a too-short field size in the database. I have other tools (EasyFind) that I can use to get the results but this sort of thing used to work in the Finder. I've gone the rounds with the Apple technical support with lots of demonstrations of what is happening. Eventually they acknowledge there is a problem but won't fix it.

It doesn't matter how good one's word processor is or how much you paid for it. It won't write your novel for you. That is up to the user of the software to learn how to use it and type something worth reading. But I do think that if your hypothesis depends on word counts of parts of speech then you'd better know exactly how they are determined and how accurate they are. Otherwise the whole thing falls apart. Or, in the words of the AnalyzeMyWriting.com site:

Secondly, these indices assume that a text is written in grammatically correct, properly punctuated English. If this requirement is not met and you feed a nonsense text into the program, the famous computing adage applies: "garbage in, garbage out."

https://www.analyzemywriting.com/readability_indices.html

James

266Keeline
mayo 11, 2022, 7:09 pm

>226 faktorovich:

I have demonstrated my method works for texts, but there is no such thing as "texts of 100% known authorship" because any text with a firm byline could have been ghostwritten, especially if no previous study has compared it with a significant portion of texts with other bylines in a given genre. Any method that relies on any line as being "100%" known simply because it is placed on a text is biased in its core, and cannot produce accurate results.

You may have satisfied yourself that you have found the TRUTH but you haven't persuaded the people who are responding here.

My quote from Analysing for Authorship (University of Wales Press, 1996), p. 17, stresses the importance of starting with text you should be able to trust — your own writing. Unless you are quoting someone else, your writing should allow you to determine the traits based on ways of measuring it.

The next level from this would be to take works where the authorship is know with a great deal of confidence and are not in dispute. This can be done, as I originally suggested, by getting writing samples from people now with enough supervision to ensure that they are not copying text from other people to muddy the results.

Your analysis seems to attribute texts to your six authors on the basis that the 287 or so texts are similar (by your measurements) to the writings of these "ghostwriters" in the "workshop" you name.

How can you be sure that your 400+ year-old texts that you attribute to people like Verstigen is really by him? How do you know it was not ghostwritten by another?

You mention "translating" texts, even in your current daily work. Let us see an example of the before and after of a couple hundred words of this process.

In an analysis where vocabulary, sentence length, proportions of parts of speech to the whole, and even punctuation (!!), are relevant to your process, how much is altered in the texts becomes important. Are we detecting traits of Shakespeare or Verstigen or Faktorovich?

Most textual analysis does use a certain amount of pre-processing of texts to make them usable by the software. This could be something like placing each sentence on its own line. Or, if it produces unwanted deviations, proper nouns might be tokenized so a short name, like "Tom" isn't counted among 3-letter words and distorting their proportion in a work. Some procedures call for dropping pronouns if they are more common in one genre than another and you are looking to explore the other parts of the text.

I did read through this article and found that it is a good introduction to using the systems mentioned, Stylo for R. It is not an absolute, hold-you-by-the-hand step-by-step, set of directions but it is the level of detail one sees in academic articles that are introducing / reviewing a process or package. It is just 15 pages plus a cover sheet and nearly all of it is pretty clear. There were some examples I have not seen and will need to learn about but that is normal.

https://www.researchgate.net/publication/313387787_Stylometry_with_R_A_Package_f...

As I stated before, I have a free account with researchgate.net and was able to download this article as a PDF with it. It was a good thing to read while waiting to see if my local court wanted me to serve on a jury. It was more interesting than the home makeover shows that were on the TV. At least the volume was low.

If you were to make a bug report to LT about some function that was not working as you thought the developers intended (not the features you wish they had, that is an RSI), you would be asked to go to the Bug Collectors group (no not insectophiles who will complain about your bug zapper) and describe the problem and detail the steps that would allow the developers to reproduce the undesired behavior. Since you have not done this for Stylo for R, I can't agree that you have

>227 faktorovich: I have already proven that Stylo/R has bugs in its system that make it unusable for attribution.

Perhaps you are thinking of these posts? Or is there another?

https://www.librarything.com/topic/337240#7674605
https://www.librarything.com/topic/337240#7674656
https://www.librarything.com/topic/337240#7676974

I know you think you have but you really quit too soon. It's like sitting in the driver's seat of a rental car and not realizing that it is necessary to depress the brake pedal before you will be allowed to turn the key for the ignition.

Not all text files are the same. One big difference between Windows (which I believe you use from your reference to Notepad in >268 faktorovich: https://www.librarything.com/topic/337240#7674808) and Unix-based systems is how each line of a file is ended. In Unix systems, the character used is a newline (\n or ASCII-10). On Windows it is two characters — the carriage return followed by the newline (\r\n or ASCII-13, ASCII-10). This alone can cause problems with some programs.

Windows (and Word in particular) is nasty about using its own standard for how to encode typographer's quotation marks, apostrophes, em-dashes, and other characters. A document manipulated in Word is likely to have these Windows-specific encodings that don't play well with standards-based systems.

It is not enough to name any old folder "corpus", it has to be in the place where the program is expecting to find it.

The program also has to have permission to read the directory. This is probably not an issue with Windows since most things are wide open. On a Unix-based system it is more common.

Finally, you refer to breaking words. A default behavior of Stylo for R is to take words like don't into two parts do and n't. If this is what you saw and are complaining about, that is configurable as noted in the ResearchGate article that I linked to above.

I've not done extensive work with Stylo for R but I like what I see about it. I am also giving favorable consideration to try JGAAP. I have used other programs previous to this forum thread. I have even written a program to implement the procedure for the method described in Farringdon's book. I made this about 20 years ago for a PCA presentation in 2003 which would generate their particular forms of graphs that Excel would not be able to replicate easily.

I have no particular allegiance to it or a stylometric process. My usual process is to set up some experiments and see if it create expected results for works with known authorship to a high degree of confidence (e.g. a long vintage letter written by and typed by an individual). Once I have that understood, then I can see what the technique offers for more questionable texts. Does it support the extrinsic evidence (letters, contracts, etc.)? Does it make sense with what I know of the circumstances?

But at the end of the day, the stylometric analysis is just an interesting sidelight. It is not the primary evidence. It can support or contradict what the other evidence indicates. If it contradicts, further research is required to understand what is going on.

For example, I know that the series book texts had many hands influencing the final product. This is really true of any text but it is more documentable in the series books. Knowing who contributed which portions can be interesting if they can be detected. It shows the process of producing books like this. That is why I look at this. But I can say with 80% confidence that my next PCA presentation will not be based on a stylometric study. There are too many other interesting things to explore in this field.

James

267prosfilaes
mayo 11, 2022, 7:40 pm

>265 Keeline: The number seems a little high since the largest dictionaries contain a little more than 1/4 this figure. Where are the other 722,000 words?

Beilstein has millions of chemicals mentioned in it; whether the verbal formula really counts as a word is questionable. Many of them are medical compounds; are Zoloft(tm) or sertraline (notm) English words? What about names? Looking at the IMDB data files, there's more than 100,000 first names there, e.g.
...
Abigael
Abigaël
Abigaelle
Abigaëlle
Abigail
Abigaíl
Abigaile
Abigail-James
Abigal
Abigale
Abigayl
Abigayle
...

Names of cities? Mountains? Wikispecies has 800,000 content pages, each with its own scientific label for some group of creatures. What's a word, and what's an English word can get quite fuzzy.

I've edited quite a bit at the English Wiktionary, and it has 588,632 pages with different names with English entries with a definition, with 1,053,233 pages with English entries counting all the grammatical entries (plural, past tense, gerund). It's a wiki, yes, but we do a fairly good job of cleaning things; we require three cites over a year with electronic sources limited mostly to Usenet. It's not the Urban Dictionary.

268faktorovich
mayo 11, 2022, 9:21 pm

>265 Keeline: "You have claimed that your collection of texts (a corpus) has X million words in it. Yet, I think you realize that the number of distinct words is much smaller. After all, you also claim the high estimate of 1 million words in English." Yes, you have figured it out, the number of total words in the corpus' texts is not the same thing as the "number of distinct words" in it.

Yes, you guys have explained that "Identifying the parts of speech is difficult to do." However, what you have failed to do is to show how any program you are pitching as an alternative to Analyze can accurately tags all words, even when they are Homonyms or look the same but have different meanings. As an experiment, I ran the old-spelling version of Percy's "Shakespeare"-bylined "Romeo and Juliet" through CLAWS. Here are the first 2 errors I found going down the list. "mou'd" was miss-interpreted as 2 words: ""NN1">mou" and ""VHD">'d". So, "mou" was categorized as NN1 or "singular noun", when it at least should have been categorized as a "verb", if the system was sensitive to the common spelling of "move" with a u in Early Modern English. And 'd was misinterpreted as "past tense form of the verb "HAVE", i.e. HAD, 'D"; this is not as far as judging "mou" to be a noun, but it still does not precisely categorize the word. And by splitting this single word into two separate words (just because it has a similar appearance with the apostrophe there as a "he'd" or "he would" type of a contraction); thus the word-count for the text and the counting for this one word into two is miscounted.

And the system is not sensitive to the idea that "Samp" can be the name of a character (as it is on a line on its own, and repeats across this play); instead it categorized it as a VVB or "base form of lexical verb".

It would be extremely difficult for any tagging system to tag any text perfectly or even with a high degree of accuracy when Early Modern English is fed into the system. A programmer has to work with a linguist to establish far more complex algorisms for spotting most of the possible glitches by running actual texts through a system and correcting all glitches by writing new lines of code that accounts for all miss-classifications. CLAWS might be a bit better at catching some word-types than Analyze, but the data it generates is not in a practically usable format without a programmer's intervention. Thus, Analyze's somewhat weaker tagging accuracy is substituted with its far more practical applicability to the attribution method I designed with programs like Analyze in mind.

Putting all this aside, are you saying that: if I edit my 27-tests method into an 82-tests method and re-calculate the entire corpus of 284 texts by running them through these 62 CLAWS tests for word-type and my other 20 tests with the standard programs, and this data still points to the six ghostwriters behind the Renaissance; then, you will agree that my attributions are correct, that my method works, and that there are no major linguistic glitches that can be preventing it from working? Or still easier, are you suggesting I just run all of the texts through CLAWS use your suggesting combination of adding all of the nouns (and other same word-types I used) together into the single measure, and re-calculate the data to see if this would generate the same attribution results; then you would agree with me? Or are you insisting that to make you believe my method works I have to create a program that turns my method into an automatic process where you can put your own texts into the system and receive an automatic result? None of these steps are rational, as you are only pitching adjusting 7 out of the 27 tests, and these 7 tests already show the same attribution outputs by themselves, as if the other 20 tests are used separately. If there was something intrinsically wrong with the Analyze tagging system that was creating miss-attributions, the 7 word-type tests on their own would not show the same attribution results as the rest, and they do; so this cannot be a significant problem with my method. If CLAWS' tagging was flawless it might be a worthwhile experiment, but it still generates a significant number of glitches, so it is not a rational test for grammatical purity. Of course the actual reason you would not be willing to agree to any of these options is because none of them involve me paying you to hire you to do the work you keep insisting I have to do. This is exactly how you all are pressuring me to buy your programming services when you are pitching completely unusable tools like Stylo or tools that require programming to make them fit my needs like CLAWS.

269faktorovich
mayo 11, 2022, 9:48 pm

>266 Keeline: "How can you be sure that your 400+ year-old texts that you attribute to people like Verstigen is really by him? How do you know it was not ghostwritten by another?" I explain in several chapters in Volumes 1-2 and in the "Restitution" translation why Verstegan was the only ghostwriter who was alive long enough and otherwise (by his biography) was the only one out of the bylines in his authorial-group of texts who could have written all of them. If you have this question, the logical thing would be to ask me for a free review copy of Volumes 1-14 of BRRAM, and then you can read the relevant chapters and reach the full answer, which is too complex to give here, since you would otherwise assume that my summary was the only evidence I have for the conclusion.

I have already quoted from my translations of "Restitution" and other texts. You can see the "before" or original versions by searching for them on Google Books or EEBO. Here is "Restitution": https://www.google.com/books/edition/A_Restitution_of_Decayed_Intelligence_In/vC... You can search for words in my translation fragments to find the relevant sections.

"how much is altered in the texts becomes important. Are we detecting traits of Shakespeare or Verstigen or Faktorovich?" I have no idea what you are trying to ask. The edits I make to texts prior to testing include deleting unreadable characters, or random dots that have no linguistic meaning. I do not edit the texts in any way that would alter their signature. And "Shakespeare" is a pseudonym, not a real person. I do not "drop pronouns" or take any of the other invasive steps you mention.

The article you are citing mostly repeats the standard mumbo-jumbo in such articles that I commented on across BRRAM. It does include a few lines of code that almost never appear in other scholarly articles on this subject, but these lines give an isolated direction when the problems with Stylo are spread across this program starting with glitches of accessing texts for testing. It would be absolutely impossible for anybody who has not used Stylo before, or even an advanced programmer who is not familiar with Stylo in particular to be able to use it for all of these different functions just based on the information in this article.

Why would I invest any time into researching Stylo's bugs when I have explained that this system is really only capable of handling word-frequency counts, and not a combination of 27 tests that my method requires? I have already completed the tests for the Renaissance, and you just haven't read BRRAM yet, and so you are not aware of what I have found out (as apparent from your question regarding why Verstegan was "the" underlying ghostwriter for one of these groups).

A better example would be if I had built a car of my own design and test driven it to show that it works. And all of you were insisting that I have to change it from electric to gas-operated and for this I had to use only the gas-tank operator that you recommend as the best one in the market, whereas the electric battery I am using is "garbage" because it needs re-charging. So you keep pressuring me to invent a new car and also to fix the bugs that are making your gas-tank suffer occasional sparks. Why don't you just review the car I already invented, and not the old version of the car that you have been using that you think I should duplicate.

When you state "stylometric analysis is... not the primary evidence", I once again realize that you still have not read any part of my BRRAM series aside for this discussion, and you have not even been reading the additional evidence outside of stylometrics that I have been giving inside of this thread.

270prosfilaes
Editado: mayo 11, 2022, 11:38 pm

>269 faktorovich: A better example would be if I had built a car of my own design and test driven it to show that it works. And all of you were insisting that I have to change it from electric to gas-operated

The electric car is about as old as the gas-powered car. Making the first modern electric car that was a serious alternative to the gas-powered car was a huge challenge for many reasons, and required extensive industrial research, not an amateur who decided to poke at it a bit. I don't think that you've shown the level of evidence for a test drive, but even then, sure, your car can drive around the block. How far can it go on one charge? Does it have the speed and acceleration to handle interstates? How long will it last? Will it spontaneously catch fire? How safe is it in an accident?

Reversing the analogy, you present us with a poorly tested system and keep repeating that it works because you know it works and claiming that the system everyone else uses is buggy and that we're in the pocket of big R, instead of looking at the problems with your system.

271faktorovich
mayo 12, 2022, 12:25 pm

>270 prosfilaes: I just watched a documentary about John Harrison (https://en.wikipedia.org/wiki/John_Harrison) "a self-educated English carpenter and clockmaker who invented the marine chronometer, a long-sought-after device for solving the problem of calculating longitude while at sea." He spent 30 years researching the best method for this clock to function at sea until he finally succeeded. Most great inventions and research takes place by independent researchers who tend to self-publish, as Galileo did (if they are not the rare few to win a grant after they have already succeed as Harrison did). Harrison was as much an "amateur" in clockmaking as I am an "amateur" in researching literature, as both of us have spent decades on our craft.

My 27-test computational-linguistic author-attribution is a finished and tested invention. "The level of evidence" is at the max when somebody has published 14 volumes full of evidence, and many more are forthcoming. I have been discussing some of this evidence in this thread, so even if you have not read any of the BRRAM series, you should be aware of the evidence by now, unless you have not been reading anything I have said, and instead have only been listening to yourself. 284 texts is the largest corpus ever tested in one experiment of the British Renaissance (which is the most challenging corpus given the linguistic obstacles with Early Modern English), so this method has been tested on the longest and most complex obstacle course imaginable. In the many periods I have tested, and the hundreds of texts, the method has consistently worked, and has not "caught fire". Whenever I have encountered what looked like a glitch with it, it was actually a unique attribution indication that has led my to a more precise authorial attribution after further documentary research.

My method works and it has led me to correct attribution results. Your method does not work, and that's why users of it keep changing between a dozen bylines for each new attribution study even for a single poem such as "Funeral Elegy". There are no bugs in the attribution steps I designed. Analyze is the least bug infested free tool I could find; Analyze has similar bugs to CLAWS, so there is no better alternative (yet); if there are bugs in Analyze or CLAWS these are not the fault of their users; my invention is not a program to count word-types, but rather the method for how to calculate authorial attribution by using free tools accessible to all. After I finish translating the volumes for BRRAM by around January, 2023, I will perform a new giant study of a range of bylines from the 18th century to reach more precise results and to write a book on the re-attribution of that century in British literature. Before I start that calculation, I will reconsider if there are any new tools that have become available in the past few years, and if I could incorporate the steps to make the process quicker and more precise. I have adjusted my method many times since I began designing it in 2015. This is why my posted data results are extremely detailed and precise. In contrast, I have read decades-old articles about the Stylo method that had identical vague descriptions of word-frequency calculations in the most recent articles.

272paradoxosalpha
mayo 12, 2022, 12:41 pm

>271 faktorovich: My method works and it has led me to correct attribution results.

You make this claim with the conviction of someone who has received in-person confirmation from the shades of these dead alleged authors. I'm beginning to wonder if you are the materialist skeptic you've sometimes presented yourself as.

Stylometric analysis (whether by your disputed idiosyncratic method or more collegially-verified approaches) can only ever be one indicator among many. The supplementary "proofs" you've pointed at in these threads have been underwhelming.

273faktorovich
mayo 12, 2022, 8:49 pm

>272 paradoxosalpha: I have the conviction of somebody who has spent the last year-and-a-half translating these writers texts into Modern English. This process involves re-writing nearly every word, and researching every hinted at meaning and checking every possible quote (without quotation marks) or reference to an earlier source. If you refer back to my explanation about the "Ghosts" of the dead writing, these ghostwriters sold their services with the claim that those who paid to have their byline printed on a book would be immortalized or would live in fame beyond their natural deaths. They did not mean that these "byline"-purchasers would live as spirits or ghosts, but rather that their names and the words attributed to them would still be around hundreds or thousands of years later (like ancient scrolls or hieroglyphic tales about pharaohs in the tombs). By reading and re-writing what these ghostwriters wrote, I have indeed come to know them very well. And what I learned is explained in the annotations and introductions to these translations that take up at least half of each of these volumes. Knowledge is not a ghost that travels into minds. To figure out what my full "proofs" are you have to ask for a review copy and read the details for yourself. My method is rational, correct, and proven; whereas the other method(s) are a black-box of cryptic claims of superiority without any fully disclosed raw data (at least none that does not reveal fraudulence or mistakes). If you are dismissing my method and my "proofs" without reading the books where I present them, you are leaping to shady spiritual conclusions, and not me.

274prosfilaes
mayo 12, 2022, 10:47 pm

>271 faktorovich: I just watched a documentary about John Harrison ... Most great inventions and research takes place by independent researchers who tend to self-publish,

A 17th century clock maker who was raised in the field hardly seems a good comparison to the modern world. I recently read that Marjorie Rice was the greatest modern mathematician who never did graduate work, and she discovered pentagonal tilings of the plane, something that needed a sharp mind and patience more than an intense understanding of complex mathematics. You again make a wild claim that I'm sure you can back up. I doubt it; a huge amount of research is done by with very expensive instruments that independent researchers don't have. David Hahn went about as far as an independent researcher could into nuclear power, and succeeded mostly into turning his back yard into a Superfund cleanup site. Medicines need labs for creation and large organizations for proper testing. My fields of mathematics and computer science are plausible places for independent researchers, but I know of no major independent mathematicians. One can bounce around computer science, but even in this young field, a lot of the stuff was prototyped in universities and made mainstream by big companies hiring university hotshots.

"The level of evidence" is at the max when somebody has published 14 volumes full of evidence

And somehow you don't believe in God/Scientology/that Elvis lives, despite the volumes and volumes of evidence produced for them.

In the many periods I have tested, and the hundreds of texts, the method has consistently worked,

You test if something works by feeding it a problem you know the answer to and seeing if it gives you the results that are known. If you want to be a Galileo, a Kepler, a Newton, an Einstein, show that your system fits the movement of the planets better than your predecessors, not that people who have been watching the skies have been all wrong about where the planets are.

Your method does not work, and that's why users of it keep changing between a dozen bylines for each new attribution study even for a single poem such as "Funeral Elegy".

Which might indicate that stylometric analysis lacks the power to give completely reliable answers, especially for works like the Funeral Elegy.

275Keeline
mayo 13, 2022, 12:01 am

>269 faktorovich:

I have already quoted from my translations of "Restitution" and other texts. You can see the "before" or original versions by searching for them on Google Books or EEBO. Here is "Restitution".... The edits I make to texts prior to testing include deleting unreadable characters, or random dots that have no linguistic meaning. I do not edit the texts in any way that would alter their signature.

You keep using translation but I think most others would call this a transcription.

Words mean things

But if your translation involves normalizing spelling or changing words to synonyms, that is where more information is needed to assess the impact of this process.

James

276Bushwhacked
mayo 13, 2022, 4:13 am

My goodness... the last time I looked at this topic thread was last Christmas... thinking it was dead, I never returned, as I had run out of the ability to laugh and crack jokes about it... but it's clearly like Frankenstein's Monster resurrected! Presumably Frankenstein being ghostwritten by some impoverished writer in a Dickensian London setting... which of course poses the question, was Dickens ghostwritten as well? Talk about down the rabbit hole...

277thorold
Editado: mayo 13, 2022, 4:30 am

>271 faktorovich: Most great inventions and research takes place by independent researchers who tend to self-publish, as Galileo did

Galileo was a professor in a leading university, as close to being a professional scientific researcher as you could be in his day. He was also a leading figure in the scientific world of his day. Even in old age, after his brush with the church authorities, he still counted as a major tourist attraction, as witness Milton seeking him out on his grand tour.
I don’t think the modern distinction between publishers and printers existed in his day, so I’m not sure how you would define “self-publish”.

Almost all inventions are made by professional people — engineers, scientists or artisans — in the course of their daily work, as incremental improvements to previous work by others in the field. Almost all inventions made by private people come to nothing, either because they are not original, or because they lack the means for putting them into practice. In the very rare instances when a private inventor invents something significant, that is news, and we read about it: you never hear about all the hundreds of unfortunate people who waste their money inventing perpetual motion machines, flying cars, spaceships and sometimes potentially useful things that simply never make it to the market. I know this because I worked in a patent office for many years. I also know that surprisingly few of my colleagues ever came up with Nobel Prize winning ideas about relativity in their spare time.

278Petroglyph
mayo 13, 2022, 5:46 am

>259 faktorovich:

I don't understand what your problem is. Large corpora allow meaningful scrutiny of relatively rare patterns? Or of very specific combinations where both parts are explicitly specified (KISS + "cheek" instead of KISS + NOUN)?

In a single text, there would not be 1813 occurrences that could be jointly categorized as noun vs. verb by checking only a single box in a tagging system that does not have an automated system for distinguishing between a noun and a verb.

??? Of course there would not be a single text with that many occurrences of kiss + cheek. Why is that noteworthy? Is this a problem? And of course a corpus that's untagged would not allow for this kind of analysis.

The 27-tests that I use are accurate for attribution, and the fact that I have performed them is registered in their data on GitHub.

You may have performed them. But that's about all that can be said about that.

One of these days we need to have a conversation about "data" vs "analyses". I don't think that distinction is particularly clear to you.

279Petroglyph
mayo 13, 2022, 10:13 am

>260 faktorovich:

The unsolicited homework you have been submitting is atrocious. I have just been investing my time into grading it for free and pointing out areas where you can benefit from improvement

K.

the overall analysis will still be counting the tendency to use or not to use them by the author in the data.

You only look at the total numbers of nouns, verbs, adjectives, prepositions etc. You have a wildly inflated sense of the kind of information that is explicitly included in those figures.

If you want to count the tendency of an author to use particular words, you look at those words in particular! You can't glance at the total number of nouns and figure that that says something meaningful about record as a verb or a noun. Or about any other individual noun, or subset of verbs, or whatever.

Same thing with looking at the six most frequent words and phrases. You think that that means you've actually considered and carefully analyzed individual words and phrases. But you only look at the top six by absolute frequency. All you've taken into account is the top six frequencies; you don't even glance at the frequencies of the other words.

280Petroglyph
mayo 13, 2022, 10:13 am

>258 faktorovich:

K.

281paradoxosalpha
mayo 13, 2022, 10:39 am

>258 faktorovich:, >280 Petroglyph:

Litigious is always a good look for a "scholar."

/s

282faktorovich
mayo 13, 2022, 2:25 pm

>274 prosfilaes: The probability that a discovery can be made by an independent researcher versus a funded dependent-on-a-university researcher is of no interest to me. My method works. And I happen to be an independent researcher. I have researched the frequent cases of the best scientists, such as Galileo, who have self-published and worked independently; but I have not conducted a statistical study to determine how much more likely it is that a ground-breaking finding might come from an independent versus a dependent researcher.

One would assume that people who believe in "God/Scientology/that Elvis lives" have read the "volumes and volumes of evidence produced" on these topics; but in truth, it is very likely that none of the people who believe in these have ever read an entire book that attempts to seriously prove in their existence. Such fields insist on belief without proof, and their books tend to be puffing restatements of shared belief, without any attempts at factual proof. If you imagine my BRRAM series has anything to do with any of these belief-based theories; then, you clearly have not read BRRAM. Every word in BRRAM is a piece of evidence that explains and supports the re-attributions and explains the history of the Renaissance from this new perspective.

No, I did not at all know the answer to "Who wrote the Renaissance" before I started this experiment. And all of the answers regarding the names of the six ghostwriters were surprising to me when the data and documentary evidence indicated them.

The planets or the texts are entirely unchanged by my re-attribution. Most of the Renaissance texts, as I have said, were originally anonymous, and have been miss-attributed by linguists across the previous centuries. So if you think I am moving the locations of the planets, and you are only happy with the original bylines; you should add many plays and subtract some plays from the "Shakespeare" collection volumes, as some did not initially have his byline on them, and some that did are currently not in the collections.

I changed my own attribution assignments a few times as I kept researching all other possible bylines. But then I reached a point where as I continued to do volumes of additional research, my attribution conclusions remained solid and unflinching because I was only finding evidence to further affirm the conclusions, and no evidence of any other alternative bylines that could have done it. The problem with previous attribution studies such as "Funeral Elegy" is that researchers have compared it in isolation with only a handful of other texts by the bylines they predetermined to be likely authors; all of them have used a corpus of no more than 20 texts, and frequently only a handful of them; it is impossible to come up with an attribution conclusion with any certainty until a study is expanded to at least 104 bylines tested, with thousands more reviewed for possible authorship, as I found out because when my corpus was smaller I had not yet added into my study many of the bylines that turned out to be the true underlying authors. Out of the six ghostwriters, my initial set of around 100 texts only included the Jonson and Harvey bylines; so I could not have reached a precise attribution, since I was missing 4 of the ghostwriters, and I found out who they were as I kept expanding the corpus of possible authors. My computational-linguistics author-attribution method can and has given precise attribution results, but only if the researcher behind it exhausts the options and keeps researching the subject long before and after the math is solid.

283faktorovich
mayo 13, 2022, 2:33 pm

>275 Keeline: If you do not know the degree of the translation I have performed in the 12 previously published volumes of BRRAM, you are confessing to never have looked inside the series, nor read any part of this discussion where I quote my translations. Distinguishing between a "translation" and a "transcription" would be easy enough if you had done the minimum degree of research by looking these books up in Amazon's LookInside, even if you can see it fit to ask free review copies. I am doing translations and not transcriptions. The degree of the translations is explained in the series. I disagree with the traditional methods of no-translation or extreme-modernization-translations that have been applied to this corpus by previous editors. My approach is to change all words that are classified as "archaic" or are otherwise out of use to their modern equivalents; I modernize and standardize spelling; and I change the order of words when it is following archaic grammatical rules. I do not change any non-archaic words simply because some words are long, complex or unfamiliar to the average modern reader. I developed my own annotation system that makes it easier for readers to grasp all of the major changes I am making in archaic words. As I said, half of these volumes are also annotations and introductions that explain the meaning, references, and my attribution decisions about the texts in the corpus.

284faktorovich
mayo 13, 2022, 2:35 pm

>276 Bushwhacked: I have not tested Dickens before, but I have written about Dickens in my books, and from this I would intuitively conclude it is more likely that Dickens ghostwrote for others, than that his books were ghostwritten by somebody else. Though it is also possible that Dickens only wrote his own books.

285Keeline
Editado: mayo 13, 2022, 2:41 pm

>282 faktorovich:

I have researched the frequent cases of the best scientists, such as Galileo, who have self-published and worked independently

Which of Galileo's books do you claim were originally self-published?

I am not seeing indication of this from the first edition title pages or commentary about him.

I do find that he had several academic posts. Here are a couple of them mentioned:

Galileo Galilei was born in Pisa in 1564, and became a professor of mathematics at the city’s university in 1589 — aged 25. Initially, he didn’t show much interest in astronomy, devoting himself instead to discrediting Aristotle’s views on the physics of motion.

When his three-year contract in Pisa expired, Galileo moved to become the chair of mathematics at Padua, at the time one of the most important intellectual centres in Europe. It was here in around 1597 that he began to develop an interest in the astronomical model developed by Nicolaus Copernicus, which made the then-extraordinary proposition that the sun was at the centre of the universe with the earth and other planets rotating around it.

https://www.christies.com/features/Six-first-editions-by-Galileo-7519-1.aspx

While I don't consider most auction listings to be wholly accurate, this one is more detailed than most. I was mainly looking for information on the publication history of his books.

James

286faktorovich
mayo 13, 2022, 2:51 pm

>277 thorold: Yes, I just reviewed a book the other day that described how Galileo was patronized by the Medici, who not only purchased copies of his printed books in large numbers, but also actively mailed these copies to those they knew to propagate as his patron for Galileo's findings. In the first centuries of print none of the printers paid authors, and only gave them author-copies, and typically authors paid for their author-copies or paid to have their books printed. The idea that a publisher gives an advance to an author also has been subverted, as an advance means a loan against expected profits; this loan has to be returned by the author to the publisher, if the expected number of sold copies is not reached. I cover the latter point in my book on "The History of British and American Author-Publishers". Since I have taught at universities as well and since I am a publisher, I am not exactly entirely independent either.

There is an incredible quantity of boundaries between an inventor in the modern world and their ability to advertise or sell their invention without the act of trying to sell it prompting a competitor to steal it before a patent or sale is solidified. Crooks that steal are more likely to have stolen before and thus have the funds to create a factory to realize an invention. Corporations have a self-interest to monopolize the market to keep people buying whatever it is they have invented centuries ago; any new invention that threatens to make their product obsolete is a threat to their profits, and so they go out of their way to bankrupt the new invention before or after it reaches access to the market. So cereal, for example, is stuck in 1863 when James Caleb Jackson "invented" it. Zero can be spent on science and development if the same flakes can be cold for 2 centuries, and people keep buying them. So however invention is currently done is not working, as it is stifling any hope for innovation. An inventor who had their first product stolen and bankrupted is unlikely to invent a second. This discussion here on LibraryThing is one of these obstacles that I am facing as an inventor. Those who use a rival, widely-accepted method are working to insult my method and to silence my study. Unlike the average inventor who would have run out of money by now; I am an independent publisher, and I have been at it for the past 4 1/2 years and I will keep going for at least another couple of years. Since I have been researching the obstacles that are constant in publishing, writing and invention, I am not surprised by this response, or the obstacles I am facing.

287faktorovich
mayo 13, 2022, 3:26 pm

>279 Petroglyph: According to studies such as, https://aclanthology.org/www.mt-archive.info/MT-1967-Earl-1.pdf, there are around 2,600 "Words forming a homonym in MW3" or Webster's Dictionary, which in total includes "470,000 entries". So, if these homonyms were spread equally among words, they would compose .6% of all words in a text. It is very likely that tagging systems use the most common word-category (noun vs. verb) for each of these homonyms, so in most cases they are likely to pick the right group even if they do not have a special feature to check proximate words for the most likely intended meaning. Even if the tagging system always miss-classifies homonyms this is likely to make up only .6% of all tagged words; this would not have a statistically significant impact on the attribution results the tests would produce.

And as I have said before, the tests for word-types are not a grammar check. It does not matter if the words are categorized in the proper categories that fits the intended meaning in the texts. These tests are mathematic sorters that group words into categories A, B, C... As long as all words in the dictionary are consistently grouped in only one of these categories, they are still being sorted accurately. The tendency of the author to use a given homonym is still tested, even if they always mean to use it as a verb, but it is being classified as a noun. If you realize that the words "verb" and "noun" are irrelevant for a mathematic equation that is counting words in category A versus words in category B. And so Analyze's system still precisely fits the needs of my experiment by sorting words consistently. As I have already said, I will research if there are any more precise programs out there before I start re-testing and expanding the corpus of my 18th century study in a few months. At that time, if I do try testing CLAWS, I will be sure to check its attribution results against Analyze, and will report my findings in my study.

I have "looked" at the frequency of other words beside the most frequent etc. I decided that those frequencies cannot be systematically compared in a simple system that gives accurate results. The top-6 words and letters do frequently tend to identify the underlying author by themselves without even having to perform any other tests. Comparing all words in any one text against all words in all 284+ texts in a corpus is statistically nonsensical, as there would be an infinite possibility of matches and non-matches.

288faktorovich
mayo 13, 2022, 3:49 pm

>285 Keeline: A better question for you is which of Galileo's books do you imagine were published by somebody other than Galileo? And how are you defining a "publisher"? The title-pages of 15-17th century books list the printer, bookseller and author; I do not recall seeing any that list a "publisher". To prove that somebody other than Galileo published his books, you would have to provide a document that says a printer/bookseller paid Galileo for selling his copyrights to them to be the exclusive publisher of a text. Galileo's patrons paying for his books to be printed does not mean the Medici published his books; they simply sponsored the publication and his research.

Taking a few examples from the page you cited. The publishing information is at the bottom of the pages under the illustrations.

"Venetiis, Apud Thomam Baglionum. MDCX. Superiorum Permussua, & Privilegio."

"In Firenze. Appresso Cosimo Giunti MDCXV."

"In Roma MDCXXIII. Appresso Giacomo Mascardi. F. Villamoeria Fecit."

Latin for:

"Venice, by Thomas Baglionum. 1610. With Superior Permission and Privilege." There have been no major studies of "Thomas Baglioni's" printing house.

"In Florence. Press "presso" means "press" in Latin, while "appresso" is "near" in Italian, so this might have been a deliberate typo Cosimo Giunti 1615." Info on these printers: https://en.wikipedia.org/wiki/Giunti_(printers) - there is no mention of Galileo, and this might be connected with the strange "appresso" typo.

"In Rome 1623. Press: Giacomo Mascardi. Printed by: F. Villamoeria." Mascardi: printer: https://library.brown.edu/projects/rome/people/0151/ "Villamoeria" does not seem to appear elsewhere.

Where in this information do you notice anything about a publisher? Only the Giuntis have been vaguely called publishers, though they were professional printers.

289paradoxosalpha
Editado: mayo 13, 2022, 4:29 pm

>277 thorold: I don’t think the modern distinction between publishers and printers existed in his (i.e. Galileo's) day
>288 faktorovich: Where in this information do you notice anything about a publisher? Only the Giuntis have been vaguely called publishers, though they were professional printers.

I find it easier to credit Thorold's premise that "publishers" emerged as a reified function out of the class of printers ("presses" and "imprints"), rather than Faktorovich's that the work of publishers could be subsumed in the identities of authors. The latter seems like an anachronistic projection of 20th and 21st-century "self-publishing." Even the Giunti instance cited by Faktorovich supports the former notion.

290lilithcat
Editado: mayo 13, 2022, 5:28 pm

while "appresso" is "near" in Italian, so this might have been a deliberate typo Cosimo Giunti

It's not a typo, deliberate or otherwise.

"Appresso" in this context means "at the house of", "the press of": https://maltamapsociety.mt/glossary/terms-for-publisher/

See, for example:

https://lcweb2.loc.gov/diglib/media/loc.natlib.ihas.200196602/0001.tif/1719#h=12...
https://galileo.ou.edu/sites/default/files/G-10-Gal-1613-1.jpg
http://sites.nd.edu/rbsc/files/2018/02/BOO_004711008-001r.jpg
https://thumbs.worthpoint.com/zoom/images4/1/1015/13/1559-del-cortegiano-del-con...
https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Caccini_-_le_nuove_mus...
https://dobianchi.files.wordpress.com/2022/01/agostino-gallo.jpg
https://www.researchgate.net/profile/Annibale-Mottana/publication/284927487/figu...
https://c8.alamy.com/comp/K4N45B/peri-jacopo-cover-of-score-for-opera-euridice-1...

You get the idea.

291faktorovich
mayo 13, 2022, 8:41 pm

>289 paradoxosalpha: I did not state that "the work of publishers could be subsumed in the identities of authors". I stated that the term "publisher" is misused (anachronistically) in modern times when it is applied, especially to the first two centuries of print. The current use of the term "self-published" as a pejorative is contrary to the long history of the best authors paying to print or finding patrons to print their work. If you are going to argue that "Giunti" supports an anti-self-publishing position, you really have to clarify what you are trying to say, as the Giuntis were printers who mostly printed books people paid them to print.

292faktorovich
mayo 13, 2022, 8:49 pm

>290 lilithcat: If you had read my full explanation, I stated ""presso" means "press" in Latin". I translate it as "press" in my translations. It does not mean "the press of"; this phrase in Italian would be, "la tipografia di"; if it is a contraction, it would be more like "lapresso" or "latipografia", and not "Appresso". While the cite you refer to claims "Appresso" means "the press of" etc. in Italian, they are wrong, as I have explained. "appresso" is "near" in Italian. An English reader of one of these books might guess "Appresso" means "press" without checking in what language this word has this meaning. Latin and Italian were frequently mixed together in books from this period, which is why a foreign-speaking publisher might have made this typo consistently on many books where this word was repeated to signify "press". It is amazing though that you guys are committed to finding errors in my statements, even if you have to create an avalanche of errors of your own to do so.

293lilithcat
Editado: mayo 13, 2022, 11:09 pm

>292 faktorovich:

Oh, right, I forgot. You never make mistakes. Obviously, all those early modern Italian printers (including a slew of others to whom I did not bother to link) didn't know their own language, or all made the same typo, or were part of the same deep-dyed conspiracy.

I don't suppose it's occurred to you that language changes over time.

294prosfilaes
mayo 13, 2022, 11:27 pm

>282 faktorovich: The planets or the texts are entirely unchanged by my re-attribution. Most of the Renaissance texts, as I have said, were originally anonymous, and have been miss-attributed by linguists across the previous centuries. So if you think I am moving the locations of the planets, and you are only happy with the original bylines;

The point of the planets is that we know where they are. You're the one who chose a corpus where most of the texts were originally anonymous, and chose to devise your own tools to handle it. You're the one who has also refused to apply those tools to a corpus that is well-known.

>286 faktorovich: This discussion here on LibraryThing is one of these obstacles that I am facing as an inventor. Those who use a rival, widely-accepted method are working to insult my method and to silence my study.

Paranoia is such a good look, isn't it. At least one of the things I think that make academics more successful is that they have, from day one, worked with the idea that they have to make their case to other people.

>287 faktorovich: These tests are mathematic sorters that group words into categories A, B, C... As long as all words in the dictionary are consistently grouped in only one of these categories, they are still being sorted accurately.

Do you have tests to prove it? Have you tried randomizing the words that go into categories, if it doesn't matter? (I know you're going to take umbrage, but that is exactly what I would ask any scholar who made that assertion.)

>287 faktorovich: Comparing all words in any one text against all words in all 284+ texts in a corpus is statistically nonsensical, as there would be an infinite possibility of matches and non-matches.

No. Just no. There's nothing infinite about it.

295faktorovich
mayo 14, 2022, 1:26 pm

>293 lilithcat: Language changes, but "Appresso" does not register in modern dictionaries as an archaic Italian word, but rather as a modern word that means "near". I did not say there was a "conspiracy" behind the term "Appresso" whenever it was used on title-pages, but merely that based on my other research, its use is suspicious. I have found several suspicious wordings on title-pages including "Nashe His Ghost", and radical books claimed to be printed in cities where no other books were printed in the neighboring decades; the implications of such suspicious wordings take a while to figure out, and I only glanced at "Appresso", tagging it as a good research project, but one that I was not going to explore further. If you assume a suspicion is enough to establish a conspiracy, the exaggeration is yours.

296faktorovich
mayo 14, 2022, 1:43 pm

>294 prosfilaes: Most of all of the texts that were printed in the Renaissance were anonymous. The 284 texts corpus I chose was less anonymous as a percentage of the corpus than the percentage of the entire pool of all texts published in those decades, because my goal was to figure out which of the stated bylines belonged to the underlying ghostwriters, and that cannot be done without testing at least 104 stated bylines.

And aside for picking texts with different bylines, the rest of the corpus is mostly texts that have been studied by previous attribution researchers, and thus are known mystery-texts. I studied most of the "well-known" texts that have been the subject of debate, such as "Shakespeare" and "Marlowe".

One of the most overused tools used to disregard a rival theory is to call the theorist "paranoid". I have made the case "to other people", and I have provided overwhelming evidence. A sign that your insults are malicious and designed to discredit an accurate finding is that you have given up on finding any specific things wrong with it, and have now just devolved into name-calling.

Have I tried creating a program that would divide a dictionary into 7 categories other than "noun", "verb" etc. and testing if it would still give accurate attribution results? No, as I have said, I have not yet created any new computer program, as I am relying on the fact that programmers have already created plenty of programs that fit the task. There are also no equivalent categories to parts-of-speech that would divide a dictionary into 7 groups. I have tested the same texts with 20 other tests, and those tests are the alternative measures that reach the same results as the 7 parts-of-speech tests. That's why I am using 27 different tests, so if there are any glitches with any single test or a few of the tests, the other tests still steer a texts to be similar to the texts it is most proximate to.

My corpus has 284 texts with 7.8 million words. A comparison of all of these texts against all others on the frequency of each of the words would be approximately = 284 * 283 * 7.8 million = 629 billion comparisons. If you try imagining the size of this 3D table that would fit all of these comparison scores on each of these words for each of the texts, you will notice why the idea of measuring all words in all texts for frequency is an absurd idea.

297prosfilaes
Editado: mayo 15, 2022, 9:11 pm

>296 faktorovich: Most of all of the texts that were printed in the Renaissance were anonymous.

Then forget the Renaissance!

One of the most overused tools used to disregard a rival theory is to call the theorist "paranoid".

Not really. It tends to go against people who believe that people who disagree with them are only doing so because "Those who use a rival, widely-accepted method are working to insult my method and to silence my study."

I have not yet created any new computer program, as I am relying on the fact that programmers have already created plenty of programs that fit the task. ... That's why I am using 27 different tests, so if there are any glitches with any single test or a few of the tests, the other tests still steer a texts to be similar to the texts it is most proximate to.

So your system is hacked together from a number of publicly available tests, unrefined and unoptimized.

My corpus has 284 texts with 7.8 million words. A comparison of all of these texts against all others on the frequency of each of the words would be approximately = 284 * 283 * 7.8 million = 629 billion comparisons. If you try imagining the size of this 3D table that would fit all of these comparison scores on each of these words for each of the texts, you will notice why the idea of measuring all words in all texts for frequency is an absurd idea.

Have you ever heard of a computer? This doesn't even get in to the range of big data, much less impossible. A mid-size system would grind through it in minutes.

298prosfilaes
mayo 15, 2022, 10:05 pm

>296 faktorovich: you have given up on finding any specific things wrong with it,

* The method is untested, and you've show disdain for running it against a known set of works to see if you would get the expected answer.

* Your unfamiliarity with other tools, and unwillingness to learn them, brings into question the quality of the methods.

* Your inability to handcraft your own tools and choice of online webpages instead of the tools professionals swear by is also problematic.

* The method is producing extraordinary results, of works normally thought to be by many authors being by just six.

* This extraordinary result seems to be caused by how you tuned the method to detect two works being by the same author; if that's turned to extremes, naturally it will produce a very small number of authors.

* You rewrite known history, like the location of Vergestan and the authorship of the King James Bible.

* You chose a section of poetry you claimed was obviously by the same author that wasn't clearly by the same author, and you backed away when challenged.

* You're offering seriously extraordinary results, like one exiled author writing all of England's religious works on all sides.

* Your answer is simplistic; there's six ghost writers, period. There's no one-book authors, Queen Elizabeth and King James were ghostwritten, as were all the clergy and all the judges.

* The Holy Ghost, part of the Trinity, can't write. I'm not a believer, and it's hard to get accurate counts of religious beliefs in places where blasphemy is still a crime, but most of the authors and readers would presumably have been religious, and claims of the Holy Ghost writing through an author would be taken straightforwardly.

Hopefully this is enough; not that I expect you to quit, but maybe I can.

299faktorovich
mayo 16, 2022, 9:10 am

>297 prosfilaes: There is no rational reason other than a self-interested investment in a rival attribution of the Renaissance for anybody to disagree that the attribution method I have described is the best option of those that have been presented so far in this field.

My system is the product of years of research, refining and optimization. Yes, it does rely on publicly available tests, so that members of the public can use this method on their own without needing to consult me in the use of specialized programming.

This calculation is not impossible because there are too many data points, but rather because there is a near-infinite number of combinations for which texts match on which words. For example, the word "impossible" can occur once in 1 text, or a varied amount of times in all texts, or a varied amount of times in half of the texts. And the same would be the case for all of the other words. All of the texts are going to have at least a set of the most common words in common. All of them will also have at least one word that does not appear in any other texts. And then in the middle of these two extremes there will be variations of counts for each of the words. The standard Stylo process claims to count word-frequency, but what level of frequency is relevant (is there a minimum number of appearances?) changes between studies, and how these frequencies are weighed in significance is never clearly explained. Comparing all words in all texts in frequency against each other is like comparing full medical histories for 284 patients to figure out which of them are related to each other; you might hit on a few obvious cases where folks have hereditary diseases, but most of the data will be useless noise. If you are sure there is any rational basis for this word-frequency for all words comparison, try posting the raw data for all comparisons and explain every step how you (or Stylo: the steps the program takes) arrived at the attributions for at least 10 texts to start (if you have not attempted to do it on a larger corpus).

300faktorovich
mayo 16, 2022, 9:42 am

>298 prosfilaes: You are repeating the same false statements. I will yet again correct you. Yes, my method is tested, and I have tested it against works with firm bylines. There are no bylines that are so firmly established that a miss-match to one of them proves a method wrong. What is important is that the method finds results that are consistent across the corpus, so that the same style is consistently identified and external evidence confirms all conclusions. This is the case with my method and findings, as I prove across BRRAM.

Your "other tools" are faulty or have bugs in them and are unusable. I tested all of the major tools you have been advertising across this thread to come to this conclusion. I am not going to re-invent your tools to prove that my better tools already work as-is.

If there is a 3D printer that already exists, it is a better idea to use this 3D printer to print out an original shape for a new invention than to invent one's own 3D printer before printing the shape one is contemplating. Once something is invented, it is absurd to invest effort into re-inventing it just to gain credit or traffic.

The results are "extraordinary", and it is also extraordinary that all of you are responding to them with insults instead of reading the evidence that firmly substantiates them.

The "extraordinary" nature of my findings is the result of the extraordinary volume of labor I put into perfecting and expanding the argument into 17+ evidentiary volumes. When I and Petroglyph tested the same number of texts and bylines, my results were mostly similar to his, with only some evidence of co-writing, and the deletion of the "Emily Bronte"-byline as a pseudonym or part of the same authorial-signature as "Charlotte Bronte". If my method was designed to shrink the number of attributions, I would have ended up with only a couple of ghostwriters in this corpus. My method works and derives the precise number of actual authorial signatures in a corpus of any size from 10 to at least 284.

Yes, "history" is regularly re-written as scholars discover archeological digs, or find new evidence, or new ancient manuscripts. There is nothing innately problematic about re-writing history, and I provide historic evidence outside of my computational-linguistic data for why the current history is a fictitious falsehood.

I have no idea to what poetry you are referring to. If I tested this poetry and made the data available, the attributions I made are the correct ones. I have not "backed away" at any point in this thread; I challenge you to prove that I did.

Yes, an "exiled" author wrote most of England's theological works on all sides, such as for and against prosecuting witches. In a book I reviewed this past week I learned that Verstegan used a similar strategy when he wrote anonymously for and against draining the fens to establish the side against draining was using threatening language and needed to be silenced. The draining included confiscating common lands from the people who lived on them, and Verstegan was hired to make it look like a debate had taken place before the paying side won.

I tested 284 texts and reviewed many others and I did not spot any unique texts between 1570s and 1640s that could have possibly been written by anybody other than the 6 ghostwriters. This is not a "simple" answer, as I support this claim by explaining that these ghostwriters directly manipulated this market. For example, Byrd held a poetry/music monopoly with the state. In a largely illiterate society, why would the monarch be literate, when the monarch has a right to a throne from birth or close to it, and so he or she has power without having to invest any energy or intelligence to gain it.

If your side of the argument depends on the Holy Ghost writing; present proof of this Holy Ghost at work and then you are starting to support this position.

You are hoping "maybe" you can make me quit? Quit what? I am in the process of translating a 400-page book (Verstegan's "Restitution") during the surrounding months. My computational-linguistic method has already been applied and already solved the Renaissance attribution problem. I proved this case in Volumes 1-2. You have not even read more than a few paragraphs from this study. I am not going to quit working. This is year 17 of me working non-stop on my research. The question is really why are you here interrupting my work by repeating nonsensical babble that you have previously stated and I already replied to?

301anglemark
mayo 16, 2022, 12:24 pm

>298 prosfilaes:

You chose a section of poetry you claimed was obviously by the same author that wasn't clearly by the same author, and you backed away when challenged.

Yes, there are only two kinds of reactions when she is challenged: either she backs away and claims that she never said whatever it was (as if the rest of us can't read – those ridiculous claims about "appresso", for instance), or she simply ignores the explanations that could have helped her understand what it is she is doing, and goes on to claim that nobody has been able to challenge her methods, or results, or definitions.

>294 prosfilaes: Re your question about >287 faktorovich:, where she said "the tests for word-types are not a grammar check. It does not matter if the words are categorized in the proper categories that fits the intended meaning in the texts. These tests are mathematic sorters that group words into categories A, B, C... As long as all words in the dictionary are consistently grouped in only one of these categories, they are still being sorted accurately." asking her about it is not going to be meaningful, since she has fundamentally misunderstood a few things. She is trying to justify the conflicting facts that a) part of her "attribution tests" has to do with how frequently an author uses nouns, adjectives, etc, and b) that the tool she has opted to use miscategorises a lot of words (and not in a consistent manner), so she can't actually say anything about the frequencies of nouns, adjectives, etc in her tests. "all words in the dictionary are consistently grouped" is pretty much gobbledegook. It's anybody's guess what she means by "dictionary" since dictionary definitions are not what's relevant here, and the whole problem with using the AMW tool is that is doesn't analyse the parts of speech consistently. As has been explained to her a couple of times. (What I don't get is why she doesn't use WordSmith, a pretty powerful tool which she has said that she uses to create word lists, to analyse the parts of speech. But it probably doesn't create colourful bar graphs.)

But to try to answer your question, if by "the tests for word-types" she meant part-of-speech tagging, any sophisticated automated tagger is going to use a combination of the parts of speech associated with a word in a dictionary, and syntactic analysis. For example, if the POS tagging software finds a phrase where a definite article is followed by a word that could be a verb or a noun depending on context, it is probably going to guess that the word is a noun in that position.

-Linnéa

302Matke
Editado: mayo 16, 2022, 3:33 pm

>300 faktorovich: I believe the reference was to quitting this thread; that is, no longer posting responses here.

Your consistent lack of understanding of what is quite plain writing is unsettling, since you claim never to have made a false statement or, apparently, been just plain wrong in any statement you have made.

That sort of thing (refusing to admit even the possibility of error on your part) really reflects poorly on your status as an unbiased researcher who has covered information that overturns a huge part of literary history. You don’t seem to grasp the fact that one who makes that sort of claim must provide proof. Your stating that you’ve spent hundreds (thousands?) of hours doing the research and proving your predetermined theory is not proof. Nor is 17 or 27 or 100 volumes proof that what you claim is true. Your conclusions don’t turn out in other programs with long histories of scholarly work; therefore those programs are in error, unusable, full of glitches.

If your theory is indeed the correct one, you should be able to duplicate the results across the board using any of a number of existing research tools. The fact that you can’t doesn’t mean that all the tools are wrong; it means that your conclusions were arrived at erroneously, or prematurely.

No one is trying to force you to stop. The only interest here is to have something besides your self-designed systems as proof of your ideas.

Please do let us know as soon as you have such proof and evidence.

Edited to add the post # to which I am responding.

303anglemark
mayo 16, 2022, 12:30 pm

>300 faktorovich:

I have no idea to what poetry you are referring to.

Here: post 561 in the previous thread. Conclusively shown to be nonsensical, in the following posts – and here in this thread there is even more evidence that your claims in post 563 are based on incorrect assumptions.

If I tested this poetry and made the data available, the attributions I made are the correct ones.

If you imagine for a second that your attributions are not correct, what might be an alternative explanation for your results?

-Linnéa

304lilithcat
Editado: mayo 16, 2022, 2:32 pm

>300 faktorovich:

In a largely illiterate society, why would the monarch be literate, when the monarch has a right to a throne from birth or close to it, and so he or she has power without having to invest any energy or intelligence to gain it.

Why?

Well, first of all, many monarchs of the early modern period would NOT have been expected to achieve the throne. Henry VII got there by overthrowing a previous monarch. Henry VIII got there because his older brother died. Neither Queen Mary nor Queen Elizabeth were assumed to be heirs; Henry kept trying for a son and got one. James VI and I could have been expected to become king of Scotland, but his son, Charles I, also became king due the death of the heir presumptive.

More importantly, any reasonable monarch would want to be literate. Being king was a treacherous business, and literacy would be an important tool in keeping an eye on, and control of, your advisors and others who might be a threat. So even assuming, arguendo, that you didn't need to "invest any energy or intelligence" to gain power, you did to keep it.

And, of course, the part of society that was "largely illiterate" was not the part to which the monarch belonged. Elites were highly literate, and not merely in English. We know that Elizabeth I, for example, was multi-lingual. Edward VI also studied multiple languages, as did James VI and I.

305Keeline
Editado: mayo 22, 2022, 5:07 pm

>296 faktorovich:

My corpus has 284 texts with 7.8 million words. A comparison of all of these texts against all others on the frequency of each of the words would be approximately = 284 * 283 * 7.8 million = 629 billion comparisons. If you try imagining the size of this 3D table that would fit all of these comparison scores on each of these words for each of the texts, you will notice why the idea of measuring all words in all texts for frequency is an absurd idea.

Other than showing you can use a calculator to multiply some large numbers to get even bigger numbers, what is the purpose of this? What does it really mean?

For example, if you had a text of 46,418 words (a real example in my corpus) and overall there are 1,607,892 in 39 works, I don't have to compare 1.6 million words. I need to know which words are unique to each text. In this case for the one story it is 3,909 words for the longest text, not 46,418.

Text file manipulation has been refined for more than 50 years in Unix type system (yes, since 1970). There are many tools of varying levels of sophistication and complexity that can be used, one after another in a pipeline, to get answers.

Let's say you have files with these names:

% ls
TS01	TS04	TS07	TS10	TS13	TS16	TS19	TS22	TS25	TS28	TS31	TS34	TS38
TS02	TS05	TS08	TS11	TS14	TS17	TS20	TS23	TS26	TS29	TS32	TS35	TS39
TS03	TS06	TS09	TS12	TS15	TS18	TS21	TS24	TS27	TS30	TS33	TS36	TS40

and you want to know how many words are in each (and sort them from largest to smallest for the illustration):

% wc -w * | sort -rn
 46418 TS34
 45695 TS35
 45556 TS22
 45208 TS36
 44939 TS23
 44823 TS19
 44517 TS29
 43945 TS13
 43763 TS18
 43742 TS31
 43460 TS11
 43360 TS28
 43265 TS38
 43121 TS04
 43112 TS20
 43088 TS33
 42935 TS30
 42665 TS02
 42651 TS03
 42436 TS16
 42369 TS24
 41892 TS05
 41723 TS26
 41706 TS01
 41690 TS27
 41564 TS08
 41526 TS14
 41482 TS12
 41402 TS25
 41337 TS32
 41308 TS10
 41135 TS07
 41041 TS21
 40986 TS15
 40442 TS09
 40411 TS06
 40310 TS17
 14469 TS40
 12400 TS39

 1607892 total

But now you want to know how many unique words in one of those files. I'll pick the largest and store the result in another directory to avoid too much clutter:

cat TS34 | tr -cd "\:alpha:\\:space:\-'" |
 tr ' \:upper:\' '\n\:lower:\' |
 tr -s '\n' |
 sed "s/^\'-\*//;s/\'-\$//" |
 sort |
 uniq -c > ../ts_uniq/TS34u

This puts all of the unique words and counts in a file (uniq -c). It is alphabetical by the word because of the sort command. I manually set the destination file path and name but this could be done with variables but I did not want to give too complex an example. There are leading spaces that LT Talk won't show but the first few lines look like this:

% head TS34u
1020 a
 1 aback
 4 ability
 11 able
 9 aboard
 158 about
 12 above
 1 absent
 1 absently
 1 absolutely

But maybe it is more helpful to have a sorted list by the frequency of a word in the text:

% sort -rn TS34u > TS34us

That new file TS34us has these for the top lines:

% head TS34us
2721 the
1454 to
1060 and
1020 a
 998 of
 831 tom
 815 i
 721 in
 667 he
 635 was

and these for the bottom lines:

 1 accommodations
 1 accommodated
 1 accommodate
 1 accidentrear
 1 accessories
 1 accepting
 1 absolutely
 1 absently
 1 absent
 1 aback

I see that Tom is used 831 times in this story and is the 6th most common word in this work. It is context-specific and these works routinely give the name of the character when referring to them. It is part of the style for this genre. The other words are ones that are common in English.

Looking at the bottom lines I see something unexpected accidentrear which is not a word but is probably a case of a typo in the source text. So I can look to have it tell me where the word appears in the file in my corpus directory:

% grep "accident—rear glider" TS34
accident—rear glider cut loose and is falling—hope they make a safe

I did not find it with "accidentrear". By looking at "accident" I found all of the occurrences of the word (52) and could spot the troublesome line. In this case, the Gutenberg typesetters put an em-dash that was not one of my characters for separating the words. I can edit the source text or change the lines of code to produce the lists of unique words.

Let's consider aback and I want to know how common that word is used in the other texts of the corpus:

% grep aback *
TS32:“Oh, did you?” Tom was rather taken aback by this reply.
TS32:“CUNNINGHAM!” gasped Tom Swift, taken aback as much by the boldness of
TS34:This seemed to take the angry man aback. He swallowed his feelings with

So it was used twice in TS32 and only once in the longest text TS34.

For a list of the file names and number of occurrences of aback I could:

% ls | xargs grep -c aback | grep -v ':0'
TS32:2
TS34:1

I could write a small loop to take the 100 (or 1,000) most common words and find out which works have those words.

The investigator needs to decide what questions to ask and learn how to get the results. After performing some sanity checks, you then have to be sure they are meaningful and can't occur from other things that are unrelated to the question being asked.

Does the weather in Buffalo, NY influence the weather in San Diego, CA? Perhaps. A little. But generally you want to focus on weather elements within 100 miles and the larger-scale systems like storms from the Gulf of Alaska that are carried south along the Jet Stream. Solar activity, including sunspots as an indicator of same are factors but probably not "Mercury in retrograde." Just because you can measure something does not mean it is significant for the question at hand.

Using the right tool for the job is important. A screwdriver might be used as a hammer but it is not as good as using an actual hammer. The complaints raised about your free website tools come down to whether they can do the job they claim to do — accurately identify and count certain parts of speech, punctuation, etc. If they have problems, anything that relies on them also falls apart.

Based on your posts in this two-part thread about your experience with Stylo for R, I don't for a moment believe that you used it enough to determine that it is filled with bugs as you claim. It requires spending more than 30 to 120 minutes before giving up and declaring it a failure. If you wrote a novel with Microsoft Word and did not manage to sell it to a publisher, is it the fault of Word?

James

306Keeline
mayo 16, 2022, 2:57 pm

As I have written before in this two-part thread, in 2002-2003 I wrote a PHP program to implement a stylometric system that showed promise. In my limited tests, the system did seem to work for texts with well-established extrinsic evidence (contracts, correspondence, etc. from the private business records for the Stratemeyer Syndicate now held at NYPL.).

But one of the disturbing things that I (and others) raised at the time was a theoretical basis to explain why the 10 or so tests used by the creators of the system would be meaningful to authorship attributions. For example, one might make a case that shorter words (2 or 3 letters, 23lw, but 4 letters in some tests, 234lw) could be mainly function words that help to join together the more contextual or lexical words in the content. But some of the prescribed tests counted these plus words that begin with vowel (ivw or initial vowel words). Why would the rate of use of "airship" (ivh) be counted the same as a word like "was"? The creators of the system who published a book on the topic and used it extensively, and with whom I corresponded some 20 years ago, could not offer an explanation or theoretical basis as to why these should matter. That caused me and others to have some doubts about the findings.

Inspired by this interesting thread, I decided to see what, if any, consensus there was on the QSUM system. Was it still in use? Was it respected? Was it admissible in evidence in British courts still as it had been in the 1990s and early 2000s? While I didn't find complete answers on all of these questions, I did find some critical (i.e. "negative") articles about QSUM and get a sense that someone interested in advancement in academia probably is not going to gain respect in their field by relying on QSUM.

This article, in particular, was written after my PCA presentation and it makes some careful and specific criticisms of QSUM which I find to be convincing. But the real reason to bring it up is what the author has to say about the importance of careful attributions since it is easier to make a claim to add a text to an author's canon than it is to remove it once it has begun to appear in major published editions. The last couple of pages offer some suggestions of ways to test any new method of authorship attribution that are relevant here.

https://www.jstor.org/stable/40372114?seq=1

As you know with JSTOR.org you can sign up for a free account, even with a Google account, and get permission to read (on screen) page images of 100 articles per month. Since most of the 24-page article focuses on the details of QSUM and the critique of it, I'll leave that to the reader who may be interested. It is not a complete presentation on the method so for that look to Analysing for Authorship by Jill M. Farringdon, et al. Instead, let's look at what is said on pp. 263 and 285-286:

Because attribution study lays foundations for others to build on, those who attribute authorship should be cautious. If and when these foundations crumble, the scholarly labor already performed is proven to be a waste of time and effort for all of us. Witness the energy expended to argue that Shakespeare wrote A Funerall Ellegye (1612): the attribution resulted in the poem's appearance in standard Shakespeare textbooks, from which it has begun to vanish now that some specialists believe that John Ford wrote the poem. But who knows how long it will take for this poem to become fully dissociated from Shakespeare. Perhaps never.

Can a new attribution method pass a "falsification test"? In Karian's example, if one selects texts believed to be of consistent authorship, if one text is removed from the corpus and tested as if it is an unknown, does it work? Repeat this process for multiple texts. Several other good suggestions follow. The final paragraph seems particularly important:

Finally, I would make a plea that proponents separate their arguments in support of the method from its application in particular cases. That is, proponents should first independently publish an account that describes the method and its limitations as well as how it responds to known and accepted cases of authorship. This step should give the scholarly community an opportunity to examine tlie method and respond. Only after a reasonable time to allow for this exchange should proponents begin to offer attributions (or de-attributions) of particular examples. I hope that this delay would decrease irresponsible claims of authorship.

James

307faktorovich
mayo 16, 2022, 8:27 pm

>301 anglemark: An inconsistent manner of tagging would be if the word "explained" was not always categorized as the same part of speech; as long as each word in the dictionary is always tagged as the same part of speech whenever it appears, this is a consistent manner of tagging. Again, the goal of my method is the attribution of a text, and to "say anything" about the grammar of a text; the method is not a verbal description of the author's precise grammatical usage, but rather a quantitative system designed to check if the quantifiably linguistic measures are similar or different between the texts in the corpus.

The free version of WordSmith that I have has limited access to the features of WordSmith. I would not want users of my method to have to purchase a WordSmith version to use my method so I cannot switch to a paid service. The only tutorial I could find on how to perform parts of speech tagging with WordSmith is this unrelated page: https://lexically.net/wordsmith/version4/screenshots/index.html?concordancingtag... You guys have not come up with a simple step to apply my method in CLAWS without adding a program, so I am assuming it would take creating a program and paying for software to tag with WordSmith as well. The character counts is the only measure that is automatically accessible in the free version. There were glitches with the other tools I experiment with in the WordSmith program when I first considered which of their tools would be best for utilization.

If modern tagging use word-order to determine part-of-speech, they are likely to make mistakes with the Early Modern English texts because they use different rules of order, including reversals of verb and noun order.

308faktorovich
mayo 16, 2022, 8:40 pm

>302 Matke: Everything you are saying is either nonsense or points I have addressed before. I have proven my conclusion beyond all reasonable doubt. There is no rational way to argue with willful ignorance.

309faktorovich
mayo 16, 2022, 8:42 pm

>303 anglemark: My attributions are correct, and I made a detailed explanation in post 563 to explain them; your hypothetical question is irrelevant.

310faktorovich
mayo 16, 2022, 8:49 pm

>304 lilithcat: Yes, that's why James I had books ghostwritten for him to increase his chance of attaining the throne when he was not next in line, and also to unify the Scottish and English kingdoms. Being king is only "dangerous" for somebody who has read the histories of past tragedies that have befallen kings. There is bliss in ignorance, so an illiterate king/queen would not be aware there are threats they could prevent by reading about how their predecessors died. I do not know why you are using the royal "We". Are there many of you writing in your account? Are you a monarch? Or are you suggesting this is an established fact? The claim that Elizabeth was "multi-lingual" was made in the Workshop's contemporary pufferies of her intelligence, for which she paid them pensions and grants; she paid at least one of them, Byrd, directly. How about if I write this puffery in the third person about myself under a different byline, would you believe it to be a fact?

"Anna Faktorovich is the most brilliant linguist that has ever lived; she has studied every language on the planet, and speaks 550 of them fluently." --Dr. SmartestMan OnEarth

311faktorovich
mayo 16, 2022, 9:22 pm

>305 Keeline: "I need to know which words are unique to each text." Do you understand what you are saying with this statement? The words that are "unique" to each text are those that do not appear in any of the other texts. "Unique" means: "being the only one of its kind; unlike anything else." These single-appearance words cannot be used for attribution because they always only appear in a single text, and thus there are no texts in the corpus that can be proven to be similar if one only considers these single-occurrence words.

You have to verbalize the steps you are taking. It looks like then, you are calculating the percentage of the "head" words? "In English grammar, a head is the key word that determines the nature of a phrase (in contrast to any modifiers or determiners)." There is no logical reason nor is there a tagger that could accurately only pick out the head-words in this sense, so you must be trying to do something.

It is clear that you are not counting "unique" words, as you are providing what looks like the beginning of a table of the count of the frequency of appearance of all words in the text, from the likely most-frequent-word "a", to words that only appears once like "aback". If you had only counted "unique" words in each text, there is no chance your list would have included "a", as it is present in all texts I have ever read in the English language.

Then, you correct this list by ordering it in the manner that is relevant for my 6 most-frequent words, with the 6 most frequent words at the top. And you now have the likely to be most unique single-occurrence words at the bottom of this list.

Okay, then you finally arrive at the actual comparative method you are using. However, you are not giving the steps to compare every single word in the corpus against each other, but rather focusing on analyzing an individual word and its usage in only 3 texts. If you have to manually compare millions of words in a large corpus like this, you would have a list of billions of similarities and non-similarities for individual words that would not actually establish any signature conclusions.

Now you have gotten to the programmable part of your method. "I could write a small loop to take the 100 (or 1,000) most common words and find out which works have those words." First, there is a big difference between 100 and 1,000 of the most-common words as if you approach 1000 * 284 * 283 you are slipping into over 80 million comparisons, or 8 million for 100. Secondly, you are going to end up with a nonsensical number of matches and non-matches for each word. Imagine that each match is a string going between a 3D box of rows and columns with as many as 80 million strings (if all texts are identical), or at least a million strings if there are many unique words that only appear once in a text. Creating this crisscrossing box of strings with a program is easy enough, but how are you going to figure out if a match between texts on 50 words with varying frequencies between 30 and 1000 is more indicative of an authorial match than a match on 100 words with frequencies between 1 and 50, or any of the other myriad of possible combinations? You are missing this essential step in your process, without which you are just finding a chaotic set of matches and non-matches, and not the answer to who-wrote-what. But instead of answering this basic question, you digressed across the rest of your response into philosophizing on why you are the smartest-human-of-them-all.

312Stevil2001
mayo 16, 2022, 9:37 pm

Doctor Faktorovich, back in >59 faktorovich: you said, "And it is absolutely absurd to demand for a computational-linguistic method to be 'easy'." The tone of that post was that you did not want the method to be reproducible. Yet in >307 faktorovich: you claim you do want the method to be so easy anyone can use it. Which one is it?

313faktorovich
mayo 16, 2022, 9:39 pm

>306 Keeline: My method accurately attributed the Renaissance texts because I compared the largest corpus ever attempted from this period of 284 different texts with 104 different bylines. As I have explained before, the "Funeral Elegy" and other previous re-attributors of individual texts have compared them to only a few other bylines or texts, and thus whenever they change their corpus to a new set of a few different texts with a few different bylines, they reach a different conclusion. The existence of only 6 ghostwriters during these decades explains why this is the case, as it is easy to pick any random set of 10 texts in the same genre and at least one of them will match the mystery-text you are trying to attribute because with only six ghostwriters probability is in the favor of at least one match. Especially when comparing dramas, Percy and Jonson wrote significantly more of the famous (mystery) dramas than the other ghostwriters, so it has been easy for attributors to ping-pong those texts between bylines (especially if neither Percy nor Jonson themselves are in the corpus, and the corpus is too small to notice, or for the researcher to care that there are other strange cross-byline matches).

It would be amazing if a precise description of an accurate method was all that was needed for scholars to look back and acknowledge they have miss-attributed the entire British Renaissance. But it seems that scholars in this field are too vain to consider any volume of evidence as sufficient to prove them to have been wrong. Previous re-attributions have indeed been irresponsible, and the 3-year delay since I first explained this method and submitted it to journals for review is too long for this review process to be based on any rational consideration for truth.

Regarding your JStor article, I solved the Defoe mystery in my 300,000-word 18th century book to which I will return to retest a much larger corpus after this project is finished.

314faktorovich
mayo 16, 2022, 9:43 pm

>312 Stevil2001: My method is reproducible and easy to use to check which texts are like and unlike each other. The difficult part, as I explained in post 59, is figuring out which of the thousands of possible bylines is the correct author when a corpus has a complex combination of multi-byline ghostwriters. Each byline's biography and the historic implications behind their potential authorship have to be weighed. This non-quantitative research stage takes years, whereas the computational-linguistic data processing takes a few days, weeks or months at-most.

315Matke
mayo 16, 2022, 9:46 pm

>308 faktorovich:
There is no rational way to argue with willful ignorance.

How true.

316prosfilaes
mayo 17, 2022, 12:46 am

>311 faktorovich: First, there is a big difference between 100 and 1,000 of the most-common words as if you approach 1000 * 284 * 283 you are slipping into over 80 million comparisons

It's Over 9000!

https://gist.github.com/Prosfilaes/b00d5d4c1164df147b4703d24dbae635 compares all words between all texts given on the command line. It's 50 lines of code banged up in a couple hours. To process 300 works with 24 million words (from Project Gutenberg) took 6 minutes, and there's obvious steps to make it much faster.

I don't recommend anyone use this code; it's a naive approach that produced mediocre results on a tighter test sample, marking The War of the Worlds as closer to The Gods of Mars than the First Men in the Moon, among other things. (It lacks pretty output, or even code to group or tree the results. Just raw distances.) It's just you're so confident about these things even when you're so wrong.

317andyl
mayo 17, 2022, 5:00 am

>311 faktorovich:

Oh and "crisscrossing box of strings" what a weird way of looking at it. You are just manipulating sets (in a CS sense) of tokens (words). It would be very easy to get a result tsv file with every word on a new line with columns consisting of the count for all the texts. This is simple.

Of course the questions you ask this data are important - you can ask sensible questions or nonsensical questions of the data. But producing that data from the raw texts is pretty quick and easy (even for all words). No manual steps are needed. The runtime will also be a lot quicker than you expect (see prosfilaes's runtime for their code). Obviously this piece of code could not possibly supply all the processed data you might possibly need (some questions require more than just word frequency comparisons amongst texts) in order to be able to write your magnum opus. But it is just as easy to put together more programs to produce the data for those questions.

You continue to be factually wrong about how feasible it is for a normal practitioner to be able to write these sort of programs. You seem to have a bad case of NIH (not invented here) syndrome and favour very manual systems over something which a computer program is much better at.

318anglemark
Editado: mayo 17, 2022, 6:47 am

>307 faktorovich:

An inconsistent manner of tagging would be if the word "explained" was not always categorized as the same part of speech

Yes, exactly! As pointed out above, (for instance) "you" is sometimes tagged as an auxiliary verb and sometimes as a pronoun. "Explained" is sometimes used as an adjective in the period you are looking at, even though it is nearly always a main verb, so it shouldn't always be tagged as the same POS. And if "explain'd" and "explained" are tagged differently you have an inconsistency there as well. Do you lemmatise your texts at all?

You guys have not come up with a simple step to apply my method in CLAWS without adding a program, so I am assuming it would take creating a program and paying for software to tag with WordSmith as well.

Nobody has claimed that CLAWS doesn't need a tool to display the relevant statistics. It's a tagger, which tokenizes a text and generates a text where each token is tagged for parts of speech. There are multiple free software tools that will generate the kind of statistics you are interested in, based on a POS tagged text, and you have been provided with links to websites that list such tools. I didn't know there was a free version of WordSmith – it's not a tool I have used for many years, myself, because as far as I'm aware it's Windows only – but I do know that even the older versions accept POS tagged texts. Maybe it doesn't generate all the output you'd like to have, but as I say, all you need to do is use another free tool of the many that are available to you.

>309 faktorovich: That is always a relevant question that every scientist and scholar needs to ask.

-Linnéa

319Petroglyph
mayo 17, 2022, 9:10 am

>301 anglemark:

I had a brief look at the usefulness of Parts of Speech for author attribution in this post, and tried both a detailed set of 50-ish POS tags (courtesy of TreeTagger), and the 7 that AMW uses. Conclusion: For small corpora with limited authors they're useful, and the more detailed the POS are, the better the results. But the more texts you add, and the more authors, the worse things get.

320lorax
mayo 17, 2022, 10:36 am

prosfilaes (#316):

It's not that the results of a word-frequency comparison are mediocre, it's that they're more geared toward finding topical similarity than authorial similarity. This type of approach is intended to be author-agnostic and topic-sensitive. Authorial idiosyncracies that persist across topics, things like sentence length or an inordinate fondness for semicolons, aren't going to come up well in this type of analysis.

(And distance-between-texts by overall word frequency analysis is so boringly standard that very early in this first thread it's what I assumed faktorovich was doing, just because it's the most obvious imaginable approach.)

More generally, one of the many problems with faktorovich's approach is that there is no ground truth, no test case. If you're trying to do topic modeling, you have some human-labeled texts, and see how well your method reproduces the labels, and tweak to optimize that. Or you have someone go back and look at what your model decides is about space travel, and sanity check it. But faktorovich has decided that there is no amount of historical documentation that would be adequate ground truth to check that her methodology correctly reproduces known authorship - no, no, it's the history that is wrong!

321lorax
mayo 17, 2022, 10:37 am

faktorovich (#300):

If there is a 3D printer that already exists, it is a better idea to use this 3D printer to print out an original shape for a new invention than to invent one's own 3D printer before printing the shape one is contemplating. Once something is invented, it is absurd to invest effort into re-inventing it just to gain credit or traffic.

This is what we're trying to tell you. But you're the one trying to invent your own.

322Keeline
Editado: mayo 17, 2022, 1:15 pm

>321 lorax:

Faktorovich wrote:

If there is a 3D printer that already exists, it is a better idea to use this 3D printer to print out an original shape for a new invention than to invent one's own 3D printer before printing the shape one is contemplating. Once something is invented, it is absurd to invest effort into re-inventing it just to gain credit or traffic.

No. There is an endless group of 3D printer models with small changes and refinements because none is really ideal and general purpose (suiting all needs). There are large communities of people designing and building these and producing plans, kits, or ready-made products. The latter are often the weakest of the bunch. Facebook and other places have large and extremely active forums to discuss the latest refinements and help people decide which 3D printer should be their first or next to buy.

Closer to our field because is it part of the process to create many of the electronic texts analyzed is the community of people developing and refining book scanners. One such group is https://diybookscanner.org/ where one can find designs and plans or kits or learn of commercial products. The software to help with various stages of the workflow is also extensively discussed.

Of course there are special interest groups to discuss practically anything. The Internet and platforms that use it draw people of like interest together. Where they once had to exchange ideas in printed magazines or annual gatherings, now they can visit a forum and see the latest things people are talking about.

James

323lorax
mayo 17, 2022, 11:00 am

Keeline (#322)

The part of the message that you have attributed to me is the part that I was quoting from faktorovich. Attributing it to me directly is extremely misleading. Please edit your message.

324faktorovich
mayo 17, 2022, 12:52 pm

>316 prosfilaes: You have an extremely slow computer if any computer process takes 6 minutes.

You are still not verbalizing what this program does. It can compare the word-frequencies for each word in each text to come up with a 0/1 of if a word is or is not in each of the texts. But then how does it figure out which texts are alike, since all texts will have some words in common and some words that are in none of the other texts, and many words with random degrees of frequency match/non-match or 0/1. If you do not know what a program you are pitching does, including the formula it is using and can vocalize the steps it is taking to carry out this formula; you do not understand why or how the system is giving you attribution conclusions, or if it is giving you anything to do with an attribution or a bunch of individual word comparisons. The "just raw distances" is precisely what I saw in the raw data when one of the top computer-linguists shared his raw data with me as we tested a set of texts to see if his process worked. When I began asking further questions, he cut off all communication. He also could not explain just how "raw distances" between 100 or 1000 different words in hundreds of texts can possibly establish a non-chaotic attribution conclusion.

325andyl
mayo 17, 2022, 1:05 pm

>324 faktorovich: You have an extremely slow computer if any computer process takes 6 minutes.

Hahaha. I've not had such a good laugh for ages. You have put your ignorance on show yet again.

326faktorovich
mayo 17, 2022, 1:06 pm

>317 andyl: If you are working on a math problem that has 3D data and you are not visualizing what the calculations you are attempting to apply is actually doing; you have a "weird way" of ignoring mathematical reality. If a solution you have come up with would look incorrect to a kindergartener, you are not on the right track. The correct solution is one you can explain to a child visually. If even a child looks at a ball of a million strings between 284 texts that represents matches at random frequencies and asks "what is that supposed to do?" you should re-think your method.

Yes it is simple to come up with a frequency count for all of the words. You are stuck on this point. The comparison step is where the problem comes in with this pop approach to word-frequency attribution.

I have nothing against computer programs. The program with your method is not that you use computer programs, but that the mathematic formula or comparison process you are telling your program to use is incorrect. The method I came up with is the correct formula for precise authorial attribution. I just have not (yet) created a program to make this process a one-step input-output answer. A program for the 27-tests would be a lot more complex than a single-test word-frequency comparison program that is measuring simple frequency. My word-frequency test requires far more steps of a potential program, as it has to calculate the 6 most frequent words, and then sort them into patterns of which words appear among these 6 and then compare these patterns between all of the texts. A full program for all of the steps would also have to clean up input texts of glitches such as lines over characters. And it has to create a full set of data tables to present the data in an easy to use way. So it would basically require the same amount of programming-time to make it usable as a website as a video game. There is no need for me to create such a program, since the manual method works to achieve the same result, and this program would be free to the public so it would be a charity-project that nobody has asked me for.

327anglemark
mayo 17, 2022, 1:09 pm

>325 andyl: I (Johan) will tell our developers this! We have to redesign our test suites and improve our source code so it doesn't take so long to compile.

328Keeline
mayo 17, 2022, 1:14 pm

>311 faktorovich:

"I need to know which words are unique to each text." Do you understand what you are saying with this statement? The words that are "unique" to each text are those that do not appear in any of the other texts. "Unique" means: "being the only one of its kind; unlike anything else." These single-appearance words cannot be used for attribution because they always only appear in a single text, and thus there are no texts in the corpus that can be proven to be similar if one only considers these single-occurrence words.

It is clear in your reply that it is you who are confused about what I am saying with this statement.

A single work may have 46,418 words but of these there are only 3,909 words.

When it comes to comparing words from one text to others, you need only look at the distinct (like that word better?) words from each text or work.

Notice that in my original quote that you are quibbling about, I wrote "unique to each text". I did not say anything about being unique to the entire corpus or an author's works.

If you are going to use a dictionary to come up with definitions, make sure it is appropriate for the topic at hand.

My purpose of that post was to show that processing text files to get interesting things is like following a recipe with several steps. Sure you might need to know where to get an ingredient, how to interpret the abbreviations for the unit of measurement, and how to add it to the process. You may also want to decide which ingredients to skip because perhaps you don't like or allergic to one of them or don't have it on hand. Of course, if you want anyone else to be able to replicate the modified recipe, you have to document the changes to produce a revised recipe.

I was not trying to create a full authorship attribution system and it is presumptuous for you to think that anyone could present one in the context of a single post reply. Instead it was more in the category of "here are some basics; it is not really as scary once you have an introduction."

If you read and correctly understand what I showed, you would see that the common word "a" was present 1,020 times in the longest text that I had. Dealing with the most common words is hardly a news flash. They are common for one or several reasons. They are usually connective words that bind content in English.

That new file TS34us has these for the top lines:

% head TS34us
2721 the
1454 to
1060 and
1020 a
 998 of
 831 tom
 815 i
 721 in
 667 he
 635 was

Other than Tom which is repeated because this example is a Tom Swift book, the other words are exceptionally common in all English fiction. It is hard to make a case that you can tell one author from another based on these words which are everywhere but in only slightly different ratios that may not be statistically significant.

I have seen more studies that focus on words that are not so common but might be favorite or "pet words" for an author.

Returning to "unique," the particular command I am using at one stage is called uniq which is used to take an alphabetical list of words (or any strings of characters in a line) and remove duplicates. When I add the option -c then it will count the number of times that the word was seen.

After I have a list of counts for the word and the word itself on separate lines, I can sort that list based on the number of appearances in a text (file) and see which are the most common used.

There is a catch phrase that is heard in our circles:

English major, you do the math.

A loop barely qualifies as a program. It is basically repeating a task many times (something computers are good at). There aren't even any decisions (if ... then ... else) being made.

James

329faktorovich
mayo 17, 2022, 1:16 pm

>318 anglemark: It would be impossible for a program that does not distinguish between homonyms to also distinguish between them. In other words, either a program has a system that always checks the word-order around "quail" (noun: the bird) and "quail" (verb: to cringe), and then categorizes them differently; or it never checks this order and always categorizes "quail" as either a noun or a verb according to its pre-set categorization for this word in its dictionary. And it would have to be some kind of a hacked glitch for a program to categorize "you" as an "auxiliary verb" unless it is misspelled as a version that is not in the program's dictionary; but "you" spelled as-is is not a homonym, so it cannot be categorized as anything other than a pronoun. As I explained, "explain'd" would be miss-categorized by CLAWS because this system interprets all 'd instances as if they are I'd or a verb such as "would". This glitch does not really come up in Analyze, which does not categorize any cases of 'd as a verb.

Again, the programs I chose to use are the ones I recommend to other users who do not have time to create programs of their own to assist programmers who were too lazy to create output that has added up the frequency of each of the applied tags.

330Keeline
Editado: mayo 17, 2022, 1:30 pm

>329 faktorovich:

You seem to be assuming that the decision of POS (parts of speech) is determined solely by looking at the word and tossing into a "bucket."

Simple systems might do that. They are also not very good.

The reason that a more sophisticated system works is that it has more than a word list with the most popular POS classification attached. But instead it looks at the whole sentence, the words around the word, to determine the context.

This is more challenged when you are using lots of words that are not found in or used the same in modern English. The systems we are discussing for this purpose and especially AnalyzeMyWriting.com expect modern English and can sometimes have application with other languages and earlier versions of English which is close to having a different language or at least a dialect.

First and foremost, the readability indices used by this website were developed with English in mind. Thus, any readability results for a non-English text will not be valid. However, analyzemywriting.com does support characters from several non-English alphabets, so character counts, word frequencies, and so forth should work well.

Secondly, these indices assume that a text is written in grammatically correct, properly punctuated English. If this requirement is not met and you feed a nonsense text into the program, the famous computing adage applies: "garbage in, garbage out." To put it plainly, if your text is a random collection of symbols, gibberish, or otherwise, the readability results will not be valid.

https://www.analyzemywriting.com/readability_indices.html

Also:

We first note that our calculation of lexical density assumes that a text is written in English. Furthermore, it is assumed that a text is properly punctuated and apostrophes are used correctly.

Secondly, since we use a computer algorithm to make the distinction between different parts of speech, not every word will be properly classified as lexical or non-lexical. And so far, no computer algorithm can do this task perfectly.\*\ Thus, any online application which computes lexical density can only offer a close approximation. But in most cases the approximation is generally a good one.

We also point out that how to classify certain words can be a point of debate \1\,\3\. For example, an aeroplane takes off. Do we classify "take off" as a verb and a separate preposition, or as a single phrasal verb? What is more, do we count the word he's as two words he is (a pronoun and an auxiliary verb) or a single word? The above illustrates some of the ambiguities which can arise and the resulting assumptions which must be made in order to make a calculation. The software used by this site treats contractions as single words and phrasal verbs as two.

Despite such ambiguities, the reader will see that, for the most part, computers can do a decent job of distinguishing lexical words from non-lexical words, and we again encourage the reader to try our lexical density calculator to better understand how this website calculates lexical density.

https://www.analyzemywriting.com/lexical_density.html

\*\ not perfectly but there are programs that can do the job far better than what AnalyzeMyWriting.com does. And that has been the point of many replies in this thread.

James

331faktorovich
mayo 17, 2022, 1:35 pm

>319 Petroglyph: You have again began your explanation by defining basic concepts like how you tag and count the tags, before skipping over the actual process of how similarity and divergence comparisons are made, skipped giving any of the raw data, and just presented a pretty graph that makes it look as if you have proven the current bylines are correct.

It is unclear why you have calculated the parts-of-speech percentages, unless you are acknowledging that my 7 parts-of-speech tests are easier and better at calculating attributions than the word-frequency measure typically advertised as being used with Stylo. So, now you are saying you have used the 7 parts-of-speech tests I proposed, and yet you are continuing to say that me using this same method is wrong. So you are discouraging users who are not programmers from using my no-programming-needed approach with existing free software; and you are telling them they have to purchase your services so you can run the data through your program (because nobody can use your method who is not a programmer who can create a program of their own or buy your services), and your program gives percentages for parts-of-speech, but no raw comparative data regarding how you decided which texts were similar based on this data. You must have used a rule such as my 18% proximity to a text, and yet you have been complaining this part of my method does not work.

Then, you attempted to combine your method with Analyze and received nonsensical results? Why aren't you sharing the raw data that you came up with from both your CLAWS and Analyze results. What exactly were the percentile differences between them that made such a huge difference in attribution. Give the raw data for 10 texts that you used and I'll duplicate this method to prove you are fraudulently manipulating the graphics, when the actual data is too similar to alter attribution results.

332faktorovich
mayo 17, 2022, 1:42 pm

>322 Keeline: If there were 1,000 people who built their own 3D printers without researching how a 3D printer inventor previously made a model; there would be 1,000 people who have wasted decades of their life to reproduce something that was already invented and in-production. What I have done is the equivalent of designing a system for building your own 3D printer by using a dozen easy steps with tools you have at home and without any special carpentry experience; and I have made this method available for free for everybody to use, and it does not require any of them to hire me to build a printer for them. What you all are proposing is for everybody to be one of those 1,000 lunatics who are all inventing their own printers, only all of them have to either use your programs, or pay you to print their 3D items for them.

333Keeline
mayo 17, 2022, 2:02 pm

>332 faktorovich:

Your replies are a free flow of hyperbole and pejorative phrases. In your ideal of a "scholar" or "academic," is this what you see?

There is a vast difference in the expectations and requirements of one who designs a new attribution method (and claims it is the only one that can tell who really wrote something) and one who uses it.

You have said before, I think in #59, that you don't expect to make your system easy to use. The extensive use of copy-paste in spreadsheets without much or any automation shows this.

There are reasons why fine furniture is not made from chainsaws except on the occasions when that style of rustic appearance is called for.

Every year there are dozens of publications and websites that review the new crop of 3D printers. In each of these there are usually categories for different kinds of users and applications with their own top-10 list. Across all of the review sources there may be, but often is not, a consensus of opinion on what is "best" or "cheapest". Even the reviewing and listing of something like 3D printers is its own cottage industry. It is a bigger field than I think you imagine it to be. Any field has a certain number of "lunatics" but I would submit that the vast majority who are designing, building, and providing this equipment are not.

James

334faktorovich
mayo 17, 2022, 2:05 pm

>328 Keeline: "When it comes to comparing words from one text to others, you need only look at the distinct (like that word better?) words from each text or work." You cannot create each "distinct" word with an equal weight, if some of these words, like "a", occur 1,000 times, and others occur only once. If your method treats these as indistinguishable for attribution, you are giving equal weight for strange words that reflect the subject of the story to the most frequent words that are over-used by a given writer, while they are barely used by others. And in fact this is what you are saying you are erroneously doing when you state you "remove duplicates". Why would you remove duplicates if you are counting their frequency? If you are not sure if you are or are not counting duplicates, your have to state precisely what variant of your method you used before describing a new experiment. If you fail to point out you removed all duplicates, readers are not aware of this glitch in your method.

"I was not trying to create a full authorship attribution system and it is presumptuous for you to think that anyone could present one in the context of a single post reply." And yet that is precisely what I did when I summarized all of the steps in my method to allow anybody to reproduce it earlier in this thread. If your method was reproducible for general users, you would have also been able to fit it into a single post.

The standard method by computational-linguists such as Jackson has been to take out the most-frequent words from consideration, as I explained in an earlier post (when we discussed the 2,000/% mystery). So, if you give extra weight or discount these most-frequent words, or give equal weight to all words is an essential question regarding an attribution method that you have to address when describing what a method is doing. The fact that the top-6 word patterns offer precise author-identification is proven by the data I gathered not only for the Renaissance, but also for the other periods. There are times when these are not exact (as when there are co-writers), and in these times the other 26 tests help to reach the correct attribution conclusion.

The study of "pet words" is extremely misleading. These "pet words" were commonly used in early re-attribution studies that shifted the Renaissance's bylines. It is easy to search texts for strange repeating words and conclude their usage means a shared author. However, it is too easy for an author who reads another author to repeat a word they like, so this method is not quantitatively accurate. And if there are only 6 ghostwriters in a corpus, it is extremely easy to find 2 texts with shared strange words written by a single ghostwriter under two other bylines or with one of these anonymous and to plant the byline on the second text onto the anonymous text, and now you have added an erroneous byline that can lead to further miss-attributions later on. This has happened with anonymous texts that have been attributed to "Marlowe" before more texts were attributed to "Marlowe" by using these previously anonymous texts as proof of shared authorship.

"I can sort that list based on the number of appearances in a text (file) and see which are the most common used." Now you are saying that you are also comparing only the most frequent words? Otherwise how can you compare the "most common" words (if there are 100 or 1000 of them) across all texts if these are going to be broadly different between any 2 texts, and will have a near-infinite number of possible mixes in a group of 284 texts?

335lorax
mayo 17, 2022, 2:09 pm

faktorovich (#324):

You have an extremely slow computer if any computer process takes 6 minutes.

Hahahahahahahahahahaha.

Some of my data preprocessing needs an AWS Sagemaker instance of ml.m5.24xlarge size - this has 96 CPU cores and 384 GiB of memory. (Not disk space. Memory.) There are still some key steps in my pipeline (most notably the pivot from ~100M rows x 20 columns to ~10M rows x 200 columns) that take quite a bit longer than 6 minutes.

You're clearly used to being the expert in the room, but in these threads, no matter what topic you're addressing, there is someone here who knows more than you do.

336Petroglyph
mayo 17, 2022, 2:47 pm

>335 lorax:

A colleague of mine trains neural networks (part of a machine learning approach to categorizing linguistic and cultural types of phylogenetic data). Before they were able to book nodes on the university's cluster computer, they frequently left the processes running overnight, and even for several days. Even with the cluster and its GPU, some processes take hours or more.

When Faktorovich writes "You have an extremely slow computer if any computer process takes 6 minutes", she sounds like someone whose longest wait is for Word to complete a "find-and-replace-all spaces" type process.

337lorax
mayo 17, 2022, 3:01 pm

Oh look, another "extremely slow computer":

https://www.ligo.org/science/faq.php#computer-clusters

"These computers are typically organized in large computer clusters, operated at several sites within the LIGO Scientific Collaboration. On these clusters, the computations can be done within days or weeks, rather than many years."

338Petroglyph
Editado: mayo 17, 2022, 4:07 pm

>331 faktorovich:

before skipping over the actual process of how similarity and divergence comparisons are made

That is a lie. I've pointed you to this formula before:

These are the calculations that are done. What more do you want?

Is it that you don't understand this formula? Would you like me to explain it to you? I will! Just say so!

skipped giving any of the raw data

If you want the data, ask. Is that so hard? Why move straight to accusations? What makes you think I owe it to the LT audience at large to provide more data? This is a post on the goddamn internet. Not a published paper. Stop confusing the two.

You're the one who's eagerly handing out review copies and Github links to Wikipedia editors in hopes that they will pay attention to you and tell you you're right. I'm not like that.

Besides, back in the Wizard-of-Oz-era of this thread, I provided all of those things to you. And you still lied and claimed I hadn't shared my data. For several other Lunch Break Experiments (tm), I have provided both the corpus and the code.

In other words: this "yOU dOn'T prOViDe THe dATa" is just one of your lines, some accusation you spew regardless of whether it's true.

Or perhaps you don't understand the distinction between data and analysis. Do you mean that you want me to share all the calculations that Stylo does under the hood? Cause that's analysis, not data.

It is unclear why you have calculated the parts-of-speech percentages

It's not unclear at all. In that post, I wrote: "{o}f course, if you don't want to do anything with these POS and all you want to know is how many there are of each, it's easy to call up a list."

So that's why: if all you need is the percentages of the POS, then have R create that table for you. It's easy. I included that section to show that a) it's possible, and b) it's easy. Isn't that usually exactly what you want?

Careless reading just so you can hurl some criticism -- any criticism. This makes me think you didn't actually read that post.

UnlEss yOU aRe aCkNOwleDginG ThAt my 7 PaRTS-of-SPeeCh TEStS ArE EAsieR ANd BEtTeR AT cAlCulATing ATTrIbuTiONS thAn tHe WORd-FrEQueNcy mEaSure TYpIcALly aDvERtIsed AS BeiNg usEd WIth StYlO

No. But I can sense your desperation to be proven right. Your tests are crap, by the way: based on a shitty look-up table, poorly implemented and mis-used.

because nobody can use your method who is not a programmer who can create a program of their own or buy your services

You have mistaken me for a programmer; I am not. And everything I used in that post is free, and had already been provided by actual programmers for me to use. No purchases were involved in the making of that post. Other people who are not programmers either use the same methods.

Your comfort zone is not the measure of science, Faktorovich. Or the measure of Lunch Break Experiments (tm).

Why aren't you sharing the raw data that you came up with from both your CLAWS and Analyze results

Again, why is that the expectation? Why is that something to be angry at? Why don't you just ask me for my data?

Also, if you had actually read that post, you'd have known I didn't use CLAWS. Didn't even mention it. And you'd know that I approximated AMW's results, because AMW does not export a tagged text.

This forces me to conclude you did not really read that post. Perhaps you glanced at it. But only enough to spew your stock accusations. There's no comprehension behind your accusations. Just automated insults.

Give the raw data for 10 texts that you used and I'll duplicate this method to prove you are fraudulently manipulating the graphics, when the actual data is too similar to alter attribution results.

Lol. Why only ten? Is it because ten texts will take you like a whole day whereas ~30 texts took me about an hour? But alright, since you asked: here is a .7z file with the original 29 novels, from Project Gutenberg, and quickly cleaned. (If dropbox gives you a popup to sign in, just X it and the file should be downloadable.) Those were my "raw data", as you requested. If you need some of my processed data, well... You know what to do.

This accusation of yours, that I manipulate my results, goes back to the Oz era of this thread. Your tune has not changed. Back then I provided you with all the tables, the wordlist, the stylo settings, and the corpus. Those files are still up, and the links I gave you then are still live. If you're really, actually, genuinely serious about proving me wrong, you can use those.

I know you won't. And so do you.

You lie about me, Faktorovich. You lie about other scholars. It's just a tactic you use to make yourself look better by comparison. But your lies are still lies. And, Anna Faktorovich, you can't lie to and about people and accuse them of "fraudulently manipulating" their results, and still expect them to treat you with respect. Your lies to and about people have consequences for the terms under which they are willing to engage with you.

You're a hack, Faktorovich. You're LARPing as a scholar with tattered Spiderman bedsheets for robes. You want to sound like a true scholar, but you have zero interest in or zero understanding of actually engaging with your material or with other people. You like the adoration that you feel should come with your pretentions of discovery, but you're unwilling to put in the work. You're incapable of backing up your accusations of data manipulation with the kind of skills needed to properly vet my methods. You simply lack the capacity to do so.

339andyl
Editado: mayo 17, 2022, 4:40 pm

>326 faktorovich:

Actually I am aphantasic. I don't visualise.

Nevertheless you don't actually have to visualise anything to do the maths*. As I said you are just manipulating sets.

* In fact some mathematicians feel that it is counter productive to visualise higher dimensions when you are working in them.

If you truly believe your manual method is far quicker and more reliable than a piece of software then I may have a bridge you may be interested in.

340faktorovich
mayo 17, 2022, 8:40 pm

>335 lorax: You just said your computer is bigger than mine: this doesn't take knowing more than me, it just takes wanting a bigger computer. You have something wrong with your program if it requires over 6 minutes to process with a computer that has nearly twice more CPU cores than all standard-issue computers. A program should be elegant and should not give your computer a headache.

341faktorovich
mayo 17, 2022, 8:43 pm

>336 Petroglyph: I have waited for a program to process stuff for a while before, but if my computer had 96 CPU cores, it would have taken moments. And it is absurd that any computer programmer would take pride in creating a program that takes over-night to finish processing. It probably means there are useless lines of code, or repetitive steps that have made the task far more complicating than it should be.

342prosfilaes
mayo 17, 2022, 8:46 pm

>324 faktorovich: You have an extremely slow computer if any computer process takes 6 minutes.

That is an exceedingly bizarre statement.

You are still not verbalizing what this program does.

I wrote it out in clear Python. If you want it in English, it turns the documents into vectors of the number of words in a document; it then normalizes the vectors, dividing all the counts by the same number so the length of the vector is one. It then takes the dot product of the vectors; if two documents have the exact same word count, their dot product is one, and if they have no words in common, their dot product is zero.

If you do not know what a program you are pitching does,

I'm not pitching anything, and I know exactly how it works.

He also could not explain just how "raw distances" between 100 or 1000 different words in hundreds of texts can possibly establish a non-chaotic attribution conclusion.

Or perhaps you weren't willing to accept his explanation, and he didn't feel like continuing. Which I understand.

>326 faktorovich: If you are working on a math problem that has 3D data and you are not visualizing what the calculations you are attempting to apply is actually doing; you have a "weird way" of ignoring mathematical reality. If a solution you have come up with would look incorrect to a kindergartener, you are not on the right track. The correct solution is one you can explain to a child visually. If even a child looks at a ball of a million strings between 284 texts that represents matches at random frequencies and asks "what is that supposed to do?" you should re-think your method.

World-class neural networks are basically opaque; how they work is a mystery to even the programmers that set them up and trained them. And they are the best solution for many AI problems, like the first Go program to beat top-level Go players (AlphaGo) and the translation code behind Google Translate. Life is not so neat that every solution can be explained to a child.

343faktorovich
mayo 17, 2022, 8:48 pm

>337 lorax: Even with this humongous computer-contraption folks are still calling it "Dark Matter" because nobody has been able to see or to otherwise prove it exists. Alternatively, scientists can just be 85% off on the volume of matter in the universe, and they are hiding errors in their equations by claiming "Dark Matter" is hiding the rest of the matter. It is indeed mesmerizing to need a thousand computers to be processing non-stop to fail to find something.

344faktorovich
mayo 17, 2022, 9:13 pm

>338 Petroglyph: Yes, this formula or those like it appear in these articles, but it is not rationally explained by any of these researchers. Can you verbally explain what is being compared to what and how. I previously explained the problems with actually applying this formula in practice; you never replied to the questions I raised.

---The formula that you do list is absolutely absurd for the following reasons. 1. By the average z-score you mean the frequency of a given word in a text. There are thousands of unique words in most texts, so how can all this data be simplified into a single average? Are you adding up all frequencies for all words and calculating the average frequency between them? This would just be nonsensical for purposes of attribution. There is no weight given to any words in this formula. To clarify your formula: k = text; author = a; n = words; z = frequency score. You are saying that by manipulating the name of the text, the author, the number of words in the text and the frequency score for all words on average, you get a single point of similarity-divergence on a graph of this texts similarity. The aren't even an author 2 or text 2 in this formula. This is the only formula this system uses to derive your attribution answers? Can you try giving an example of how this formula was applied for a specific comparison? It just seems to be nonsensical double-speak.---

I have asked for an explanation and the data, and instead you are repeating the question.

I am not forcing you to publish a paper for free on the internet in this thread. You are objecting to my method, and insisting that your method is the only correct method of attribution. So, yes, provide the data and the answers, or stop complaining about my clearly explained method, and fully provided raw data.

You have not provided the raw-data or the frequencies for all words tested, and all text-versus-text comparison scores, and how these led to the attribution conclusions in your pretty picture diagrams. And yes, I mean the calculations that Stylo does before spewing out the final attribution diagram. The conclusion is in no way equivalent to raw data. Are you saying you cannot go under the hood and find the raw data in Stylo? I provide all stages of my computations in my data: the raw frequencies/data gathered from different programs; the proximity 1/0 results for all text-to-text comparisons; the total number of matches between all compared texts; and a diagram that organizes this data into authorial groups. And I also provide several different diagrams that visualize the data. Instead, you just provide the final visualizations, and what happens to get there is a black box. If you are refusing to show what is in your box, then you cannot argue that whatever is in it is superior to my fully exposed box.

You did not include any data in the file you linked to. You just included the texts you found on Project Gutenberg; at least one of them has not had its Project Gutenberg legal terms-of-service deleted at the end of the file. If you think the texts are "data", and you need to be asked to provide data that resulted from you doing any processing or calculations; then, it is likely that you have not been doing any processing at all, and thus all you have is the texts before anything is done to them.

A list of numbers next to all of the words in a text do not explain what actual formula you used to arrive in them, or how these individual numbers were combined or compared to arrive at the attribution decision. How does the frequency for each word, turn into a percentage/rate of similarity between the two texts. The formula you gave is not designed to compare any two texts, but rather to rest a single text.

345faktorovich
mayo 17, 2022, 9:16 pm

>339 andyl: Mathematicians who claim to work in dimensions beyond the 4 proven dimensions are making up fiction; so if they cannot even imaging what they are talking about; they are really just magicians who have tricked people into buying their unimaginable science fiction.

346faktorovich
mayo 17, 2022, 9:36 pm

>342 prosfilaes: "It turns the documents into vectors of the number of words in a document; it then normalizes the vectors, dividing all the counts by the same number so the length of the vector is one. It then takes the dot product of the vectors; if two documents have the exact same word count, their dot product is one, and if they have no words in common, their dot product is zero." Here is the method as you are describing it. Step 1: Count number of words in each text. Step 2: Normalize the number-of-occurrences of each word in each text so it reflects the words-in-that-text. Step 3: Compare all words for all texts by either marking text-vs-text for each word as having the same word (1) or not sharing that word (0). Again, this would generate a 0/1 table that would have billions of 0's and 1's; and without having a system for what indicates a match, it would be entirely chaotic and entirely unproductive for figuring out who wrote all of the texts in the corpus or which texts are similar or different from each other. Your approach, as you describe it, also means you are not considering how frequent a word appears, but instead if it appears in the text being compared against; thus the most frequent words like "a" have the same weight as the most rare words that appear only once in a single text, and skew the results to a 0 for all other texts in which this word does not appear. Your method does not work as-described to provide any usable data for attribution.

I just watched a news report about a banker who convinced investors/borrowers that he was using artificial intelligence to "revolutionize" banking and this generated billions in capital, most of which he miss-invested and nearly collapsed the company. AI or "neural networks" are designed to be "opaque" because they are black-boxes where absolutely nothing can actually be done, while the "programmer" draws a pretty diagram with the conclusion that is in his own or his client's self-interest. If a product or service you are selling cannot be audited, or an auditor cannot look inside the system to determine how the solution was arrived at; then, this system should not be used to change the bylines of classic texts, or to shift billions in invested capital. Google Translate is not run by "AI", but rather with a checkable program and dictionaries. If a chess program is fed the rules of chess, or attempts every possible move before deriving the 1/0 or good/bad choices to make in different instances; it is using checkable math, and not mystical calculations that are uncheckable.

347prosfilaes
mayo 17, 2022, 10:22 pm

>346 faktorovich: So, here's the thing; I gave you a clear English description and then you ignored it, in favor of what you thought it should do.

Your claims about AI are ... amazing. You could work through something like Get Programming and then Deep Learning and the Game of Go, and then you'll have first hand awareness of how amazingly puny the computer is when trying to look at every possible chess game, all 10¹²⁰ of them. Or you could continue demanding that people read all 15 of your books and ignore any obligation to know something about other subjects before speaking on them.

348paradoxosalpha
mayo 18, 2022, 2:08 am

>345 faktorovich:

Dimension is a mathematical notion in no way limited by space-time, which is itself of dubious actuality. You have a remarkable and unbecoming propensity to dismiss as "fiction" ideas that you've obviously made no effort to understand.

349anglemark
mayo 18, 2022, 2:59 am

>329 faktorovich:

And it would have to be some kind of a hacked glitch for a program to categorize "you" as an "auxiliary verb" unless it is misspelled as a version that is not in the program's dictionary; but "you" spelled as-is is not a homonym, so it cannot be categorized as anything other than a pronoun.

Yet again, you glance quickly at a post and respond to what you think it says.

"explain'd" would be miss-categorized by CLAWS because this system interprets all 'd instances as if they are I'd or a verb such as "would". This glitch does not really come up in Analyze, which does not categorize any cases of 'd as a verb.

The error rate of CLAWS in EModE texts was the topic of the very first post discussing that tagger.

Your claim about AMW is fractally wrong in new unexpected ways. Well done.

-Linnéa

350Keeline
mayo 18, 2022, 10:55 am

>341 faktorovich:

I have waited for a program to process stuff for a while before, but if my computer had 96 CPU cores, it would have taken moments. And it is absurd that any computer programmer would take pride in creating a program that takes over-night to finish processing. It probably means there are useless lines of code, or repetitive steps that have made the task far more complicating than it should be.

This expresses an unreasonable certainty for one who seems to find the concept of a "loop" of a computer code intimidating. Have you ever written and optimized a computer program of any level of complexity?

This computer has 6 cores but I am often waiting with much of it unused. For example, major software packages like Adobe Acrobat only are configured to use one thread and one processor core. This is most evident when performing OCR on a PDF. The processor load meter shows that it is barely tasking the hardware and the process can take a good deal of time — certainly much longer than your 6 minute benchmark. Even though the first computer I had with a 2 core processor was purchased more than 20 years ago, Adobe Acrobat "Pro" has not been written to take advantage of available hardware. The computer is more than 80% idle during this process (if nothing else is running).

One might wonder if this is just the nature of OCR that it is a very sequential task that cannot benefit from parallel processing to use multiple threads and/or multiple cores. Adobe collects a lot of money each year from the paid users of Acrobat Pro. Some get it individually but most get it as part of an annual license for many Adobe programs. But any way you look at it, there is funding available for development or significant refactoring of the code to reflect the type of computers available for 1/5 of a century.

Can no one do it better than Adobe with all of their developers and money to pay them? Yes, as it turns out. One such example I found was a command-line Python script called OCRmyPDF which uses the Tessaract OCR engine. When this program is running, you know it. The evidence of the computer working hard is indicated by the greater heat produced, turning on the exhaust fans, and the processor load indicators. It also runs faster for the whole job.

The accuracy of any OCR is always a bit less than ideal. Factors include how clean the page images are and their overall resolution. Excess noise will be interpreted as other characters or punctuation and require a lot of proofreading if one is going to do more than a keyword search (my most common use case).

For work I wrote a program late last year to export some of our key data for an outside analyst. To put it in a form he could use, each of the more than 1.5 million rows had to be pulled from several database tables with nonstandard JSON coding and process them for the export. If this was run as a single process, it takes about 8 hours. So if there is a mistake, a lot of time can be wasted. But I found that I could use a tool called, appropriately, "parallel" to run 8 processes at a time in parallel since that was the number of cores on the cloud server available for the task. This involved dividing up the larger task into smaller ones and planning so each job could be done independent of the others and later combined. The new run time for the whole was based on the run of one of the slower sub-tasks. It still took hours for the whole. But when I did an incremental update recently for about half a million rows, the parallel project took about 45 minutes where it would have taken about 8 times that as a sequential process.

To people aware of computer architecture or system administration, this is a normal part of doing business. It is unusual and requires a little bit of planning but it is not revolutionary. But for someone who only double clicks on an application icon and clicks buttons inside it, this is probably not part of a consideration.

This parallel tool is something I use several times a week when I want to convert images from one format to another. To do it sequentially is a slow process. But in parallel it is about 5-6 times faster. This is an example of code to do the conversion sequentially using the ImageMagick utility called "convert". Warning: it uses a for-loop.

for f in *.tif; do echo "Converting $f"; convert "$f" "$(basename "$f" .tif).png"; done

And this is the command to invoke it with parallel:

parallel --bar convert {} {.}.png ::: *.tif

And this is just using one computer. Big jobs usually involve sending off smaller tasks to other computers on a network. This can be called a "cluster" and there are several approaches, including a Beowulf cluster. I taught courses on this about 20 years ago.

If there is a reason to do lots of comparisons or counts, there are ways to achieve it. In a huge multi-body simulation such as a weather or "dark matter" experiment, you can be sure that they want to get the results as fast as possible while preserving the integrity of the data. So they make use of the tools and resources. If a task takes hours or days or more, there is a good reason for it and to claim that there are:

probably means there are useless lines of code, or repetitive steps that have made the task far more complicating than it should be.

is naïve in the extreme.

James

351Keeline
mayo 18, 2022, 12:15 pm

>343 faktorovich:

This is not a suitable forum for discussing "dark matter." This comment expresses an understanding of what is referred to from a very slight perspective.

Our ability to "see" something in the universe involves the transmission of radiation (light, infrared, X-rays, radio waves, etc.) over long distances. So we generally "see" luminous bodies like stars with a fusion reaction. Sometimes we see that light is bent by a very massive object (gravity lensing). Other times the light produced is blocked as it passes behind an object.

But not everything emits radiation that can travel the long distances to be "seen" by us and our instruments.

As models are built of the other galaxies which we "see" with instruments like the Hubble Space Telescope, there is a discrepancy in how they move and interact with other bodies that can't be completely accounted for by the mass of the luminous (light-producing) objects. That is the reason for saying that non-luminous matter may be present. We can't see it but we can see the effects of it on the other objects.

It is like having a double star where one of them is very small and massive. The brighter star that can be seen wiggles a bit in the careful measurements. As models are built up, it can be described as two objects revolving around the center of mass of the pair. In time, if the conditions and instruments are right, the second object can be "seen."

I think about the earliest attempts to actually quantify the speed of light and what it takes to measure something that is so fast. The idea of being able to make a reasonable estimate in the early 1700s is rather extraordinary. It involved noting the time it took for some of Jupiter's moon's to transit (go in front of) or be eclipsed by the planet and noticing how the times were different when the relative positions of the Earth and Jupiter changed. Of course other methods were devised over time and the value has been refined to a high degree.

But if you think that all mathematical concepts, scientific experiments, or complex systems should be something that can be described to and understood by a kindergarten-age child, your expectations are of a world that is far simpler than it really is.

Speaking of which, what do kindergarteners think of your attribution system?

James

352faktorovich
mayo 18, 2022, 2:38 pm

>347 prosfilaes: Yes, most books about chess statistics cite this "Shannon number", but it does not take into account pieces being consumed as the game progresses, or the fact that the pawns get stuck and cannot move forward after meeting those from the other side. And it adds up white and black pieces' choices separately, whereas the computer would only be calculating the possibilities from one of these perspectives to win. And the computer would be calculating risk and best-options in each move separately and not for the entire game in advance. And if a game ends in 10 moves because the computer picks the ideal moves in those 10 moves, then it does not have to perform astronomic quantities of calculations, as the numbers increase with each new move, but for each individual move there is a similar shrinking number of calculations (if it counts no more than 10 moves in advance). There is no advantage in any of this for a machine with artificial intelligence, as it instead needs a very precise understanding of the rules, and to imagine the possible future moves and if the probability they would lead to a positive or a negative outcome. If you are attributing the win to an artificial intelligence, you are saying you have not created the necessary statistical etc. programming, and the victory was mostly just random or "luck".

353faktorovich
mayo 18, 2022, 2:44 pm

>348 paradoxosalpha: I wrote a 200,000-word scientific fiction novel about what it would realistically be like for a species to travel to find a new habitable world and for it I return to my study of astronomy back in college. I would not have made any statement about the anti-scientific nature of dimensions beyond the 4 proven dimensions, if I had not given this subject previous extensive thought.

354faktorovich
mayo 18, 2022, 3:03 pm

>350 Keeline: If you look back through this thread, you will notice that when I tested Stylo, I created a little computer programmer to test the method. I have previously reviewed and studied some computer programming textbooks and tested a few programs. I also frequently use HTML. Computer programming is not a mysterious unpassable subject; it just involves using standard phrases to achieve precise outcomes.

There is a difference between a program that achieves its stated purpose after taking a while to process data, and a program that never finds "dark matter" or anything else it is supposed to be searching for/calculating after processing for years or decades with an enormous array of computers (or cores). There are ways to zip a large file, or to minimize the time a program takes, as you have said. And yes, in many cases Adobe and other software companies do not take these minimizing steps because they have near-monopolies on the market, as anybody using something other than Adobe's suites in design is laughed out of the room. If people keep buying your product even is you make only tiny updates annually, you can keep the industry decades behind. Whereas if we had an open software market where individual programmers could take better products to-market independently (perhaps with help from a cheap or free platform for such exchange), we might all be picking between a dozen new increasing cheaper and far more advanced design programs annually. So, my statement about "useless" code and "repetitive steps" was not "naïve", but rather matches exactly what you have restated in your post.

355paradoxosalpha
Editado: mayo 18, 2022, 4:29 pm

>353 faktorovich:

Your "study of astronomy back in college" and having written "a science fiction novel" are neither of them evidence of labored consideration of advanced mathematics.

The spacetime of "the 4 proven dimensions" is a naive and misinformed reference.

"Many, many separate arguments, all very strong individually, suggest that the very notion of spacetime is not a fundamental one. Spacetime is doomed. There is no such thing as spacetime fundamentally in the actual underlying description of the laws of physics." --Nima Arkani-Hamed, Cornell Messenger Lecture 2016

"I am almost certain that space and time are illusions. These are primitive notions that will be replaced by something more sophisticated." --Nathan Seiberg, quoted in a 1999 Los Angeles Times article

356faktorovich
mayo 18, 2022, 3:22 pm

>351 Keeline: The instruments currently used in astronomy are barely starting to detect some planets in solar systems outside of our own. If objects are not behaving as-expected in distant galaxies, the simple answer is that we our instruments are insufficient to detect all objects in those parts of space (perhaps because some are directly behind others), or the math might be incorrect as we are miscalculating the distance of these objects, and thus their mass is also miscalculated. Instead of figuring out what glitches are leading to anomalous readings, astronomers have chosen to lump all unexplained matter into categories they just cannot see or calculate. Britannica states: "Dark matter makes up 30.1 percent of the matter-energy composition of the universe; the rest is dark energy (69.4 percent) and “ordinary” visible matter (0.5 percent)." In other words, scientists are claiming they have only found .5% of the universe, and the other 99.5% is stuff they have not yet proven has any matter or measurable energy readings. Out of just the matter, visible matter is then only 1.6%, while "dark matter" takes up the other 98.4%. This same data can also mean that astronomers have discovered their calculations about distant stars is over 90% wrong, but with "dark matter" and "dark energy" as their invisible explanation, they can continue running the same types of tests, without any new scientist needing to come in to figure out where they are wrong on their math.

The scientist who first calculated the speed of light had institutional/ funding support. In contrast, a scientist who attempts to disprove the existence of "dark" matter/energy will not find any funding or support because if he succeeds in finding a rational explanation for the previous mathematical miss-calculations he or she would be proving that astronomers who have been supporting these theories for decades have been selling science-fiction.

You are right, a kindergartener would not be able to spot science fiction like the concept of "dark matter" from the reality of humans' preferment for career-advancement over finding scientific truths.

357faktorovich
mayo 18, 2022, 3:25 pm

>355 paradoxosalpha: So you are saying that the attribution method all of you guys have been supporting is similar to the study of dimensions outside of the 4 proven dimensions? It is as likely that your attribution method is correct, as if "space and time are illusions"? I feel still more re-assured that you guys are proving that my side of this debate is the rational perspective, and yours is the extremist fiction.

358lorax
mayo 18, 2022, 3:49 pm

faktorovich (#353):

Remember how I said that for any topic there is someone here who knows more than you do?

My PhD is in astrophysics. But go ahead, tell me how the idea of dark matter is absurd.

359Keeline
mayo 18, 2022, 4:05 pm

>356 faktorovich:

The instruments currently used in astronomy are barely starting to detect some planets in solar systems outside of our own.

And how are they doing this? It is not that they can see them with visible light. Instead it is by noting the ever so slight changes in position as the visible object (a star) is influenced by the other ones. This is why the extra-solar planets discovered are often very large.

Only one potential (i.e. unconfirmed) exoplanet has been postulated in another galaxy. But of the ones confirmed in this galaxy there are more than 5,000 that have been claimed with a wide range of sizes and distances from their stars. The Wikipedia page tries to summarize this.

https://en.wikipedia.org/wiki/Exoplanet

Note the section on "direct imaging".

Trying to detect an extragalactic planet is the new thing to do for astronomers. Confirming it by any means is another matter considering the distances involved.

https://en.wikipedia.org/wiki/Extragalactic_planet

James

360lorax
mayo 18, 2022, 4:12 pm

I'm not sure why I bother, but for the amusement and perhaps edification of others, some thoughts about dimensions.

For most day-to-day purposes, we can treat the universe as though it has four dimensions - three of space, and one of time. One of the most popular theories explaining the properties of subatomic particles (especially why gravity is so much weaker than the other fundamental forces), however, posits that there are ten dimensions - the other seven are "compactified" in such a way that they aren't detectable in most ways. (Disclaimer: as I said, my PhD is in astrophysics, not particle physics, and I never got beyond regular old quantum mechanics - no QED or QCD here. So my knowledge here is barely above "educated layperson". I invite correction and clarification from real particle physicists.) Detecting these additional dimensions via their effect on extremely massive subatomic particles is an area of active research, but requires massive particle colliders.

However, regardless of the number of dimensions of ordinary physical space, people doing data science work with many dimensions every single day, to the extent that a common problem encountered by people doing things like finding the distance between documents based on word frequency - sound familiar? - is known as the "curse of dimensionality". If you want to find a "distance" between two points, you need a space (which can have an arbitrary number of dimensions, even one - any kid can measure the distance between two points on a line). Correct definition of this space is important - the distance between two antipodal points on the Earth is very different if you're operating in the 2-dimensional space of the Earth's surface or the three-dimensional space that measures the difference through the Earth itself.

To choose a topic less charged than document similarity, let's build a very simple model to figure out how similar two libraries on LT are, based on the books they contain. A naive approach would be to build a massive user/item matrix - that is, for every user, record with a 1 or a 0 whether they have a particular book. You can then imagine each user existing somewhere in bookspace, with a mind-boggling 27 million dimensions (one for each work on LT), sitting at zero units on a particular axis if they have the book, and 1 unit if they do. Then it's a straightforward vector comparison to find the distance between two users.

This is where the "curse of dimensionality" comes in. Those 27-million-dimension vectors are very, very sparse. Most users have only a tiny fraction of all the possible books on LT. So to find the similarity between users, smart data scientists will use tricks to reduce that dimensionality. Maybe they'll ignore all single-copy works. Maybe they'll use metadata like genre or year of publication instead to construct their space - this user lies far out on the "science fiction" axis, with lots of SF books, but near zero on the "theology" axis, and far on the "21st century" publication axis as well. Feature engineering is an art. But any data scientist worthy of the name deals in many more than three or four dimensions every single day. Remember that matrix manipulation I mentioned earlier (the one faktorovich sneered at?) That's a dimensional transform in the other direction, going from about 20 dimensions to about 200. I don't end up using all of those - more like 40 in the final model - but if I refused to use more than 4 features in a model because we live in a 4 dimensional universe I'd rapidly be out of a job.

361paradoxosalpha
Editado: mayo 18, 2022, 11:47 pm

>356 faktorovich: The instruments currently used in astronomy are barely starting to detect some planets in solar systems outside of our own.

You are not current with astronomy because you once took an undergraduate course. "To date, more than 4,000 exoplanets have been discovered and are considered 'confirmed.' However, there are thousands of other "candidate" exoplanet detections that require further observations in order to say for sure whether or not the exoplanet is real."
https://exoplanets.nasa.gov/faq/6/how-many-exoplanets-are-there/

>357 faktorovich: So you are saying that the attribution method all of you guys have been supporting is similar to the study of dimensions outside of the 4 proven dimensions?

What I am saying is that your claims deriding calculus of n dimensions and asserting the veridicality of "the four proven dimensions" are uninformed and bordering on fatuous.

You lump me in with the "you guys" of your paranoid ideation, as if everyone who disagrees with you has a single shared perspective. I don't subscribe to any codified "attribution method" for literature--certainly not yours, as I assess it through your defense of it in these threads.

362Keeline
mayo 18, 2022, 4:24 pm

>345 faktorovich:

Mathematicians who claim to work in dimensions beyond the 4 proven dimensions are making up fiction; so if they cannot even imaging what they are talking about; they are really just magicians who have tricked people into buying their unimaginable science fiction.

Is this from the same person whose authorship tests involves no fewer than 27 measurements (counts reduced to binary flags) of a writer's text?

The science fiction portrayals of extra dimensions and parallel worlds and the like is great for dramatic fiction but it is not what the scientists and mathematicians who work on things like String Theory are talking about. I don't understand the details of it but I can recognize at least this level of understanding on the topic. I only took five years of physics in university and that was many years ago.

James

363Petroglyph
Editado: mayo 20, 2022, 8:23 pm

>344 faktorovich:

You asked for the "raw data". So I gave you those: the raw, unedited input, for which the only preprocessing was getting rid of the PG legalese and title pages etc. (I missed a few paragraphs at the end of a Woolf novel -- good that you pointed that out. I was going to use that for another Lunch Break Experiment (tm), and I've now fixed that oversight.)

I did not give you any of my analyses -- the separate stages of processing the data to insert POS, and then extracting only the POS. Because you did not ask for them. You said you wanted the data. So that is what you got.

Mean what you say, Faktorovich, and say what you mean. Your confusion between "data" and "analysis" is not something you can put on me.

The formula you gave is not designed to compare any two texts, but rather to rest a single text

The text above that formula explicitly states it's about comparing a text to an authorial signature generated from a body of texts. "Author subset a" means "the scores generated from a body of texts belonging to the same author that you're comparing text k to."

You can repeat this formula as many times as you want, for as many texts as you have. It's the same formula for comparing two texts, or five. You repeat this for every pair of texts (or groups of texts by the same author) that you care to apply it to.

Are you adding up all frequencies for all words and calculating the average frequency between them?

No, I'm not. That would be stupid.

This would just be nonsensical for purposes of attribution.

It would be. Fortunately, that's not what this formula means. And now you can stop making this silly accusation.

The aren't even an author 2 or text 2 in this formula.

You can repeat this formula for however many texts or authors there are in your corpus. And the a figure explicitly contains a bunch of a single author's texts. You do not understand what you are looking at.

There are thousands of unique words in most texts, so how can all this data be simplified into a single average?

... You, with your 0/1 techique, are seriously wondering how a single figure can summarize an entire sequence of steps? Is this a "I'm gonna sound incredulous just because it makes that formula sound nonsensical" or are you genuinely unable to read this?

How are you going to critique the analyses (not "data") if I were to give them to you? You don't understand the steps taken here?

You are objecting to my method,

I am.

and insisting that your method is the only correct method of attribution

That's you putting words in my mouth. I've never said that. This is your extremist either/or way of thinking; this is not how other people think.

If you think the texts are "data",

They are. What else? Mean what you say, Faktorovich, and say what you mean. Your confusion between "data" and "analysis" is not something you can put on me.

If you think the texts are "data", and you need to be asked to provide data that resulted from you doing any processing or calculations; then, it is likely that you have not been doing any processing at all, and thus all you have is the texts before anything is done to them.

Those steps are part of the analyses. You did not ask for the analyses.

Let's try a Faktorovian reply to this: Your GitHub repo only contains the tables where you claim to have processed texts and the results you get from Analyzemywriting and other sites. But the actual texts are not there. ~~Nor are the actual percentages from AMW and the others -- only what you did to those results~~.

From this, it follows that I can say something like "It is likely that you have not been doing any processing at all, and thus all you have is your manipulated tables. Unless I can see your actual texts, and the actual results from AMW and wordsmith and Online Utility and LIWC for every single one of those texts, you've left out steps of your analysis and I cannot trust your results. Post all of that, or I'll just call you a liar and a data manipulator."

See? I've just
a) explained how your results are not worth even looking at
b) dismissed them out of hand, and
c) it cost me no effort, other than claim "this is suspicious".

I can see why you like this tactic: it allows you to skip all the hard work for no effort: lazy accusations and shit-smearing. And if you protest, I can just claim I've already "explained" the flaws in your methods and act all exasperated that you're not understanding me.

If academia worked the way you think it does, if society worked the way you think it does, there'd be no academia, and there'd be no society.

364Petroglyph
Editado: mayo 18, 2022, 7:32 pm

But we've been here before: Faktorovich is asking for every single detail of the analyses, because, so she implies, she is personally going to vet and audit those steps. And unless she can do so personally, she's simply going to declare "dAtA mAnIpULAtiON" and be done with it.

The thing is, she'll do that anyway.

See, the last few times we've been through this rigamarole (i.e. Wizard of Oz, Austen-Brontes-Corelli) I gave her more data and some processed data, too (e.g. the config file, the frequency table, the word list, the code). And she straightaway jumped to misunderstandings of Z-scores and accusations of manipulating the tables I gave her specifically so that if she checked them they'd confirm my original results, which were also, so she stated, manipulated.

In other words, even if I jump through her hoop, she's simply going to glance at figures beyond her comprehension, jump to a misunderstanding (e.g. "5000 x 5000 x #-of-texts" is how Delta works; or "Mathematicians who claim to work in dimensions beyond the 4 proven dimensions are making up fiction") and shout "dAtA mAnIPuLaTiON".

I know this because she's done this before, and any excuse will do. For instance:

Msg #333: This is the definition of bias, the 5,000 words your "Lunch" experiment tested for was not a randomly-selected list, or a list of the top most-frequent words in all of the texts in the corpus, but rather a list you manipulated with "edits, deletions" etc.

Msg #346 Comparing such a large quantity of words against each other would create a 5000X5000X#of-texts dimensional mathematical space unless there is a trick that simplifies all this or evaluates something simple - or your formula that obviously does not fit your description of your methodology.

Msg #346 "Your failure to even explain the formula you said you are using is just one example of how you are not making a full disclosure here of your method, and if you did, even more falsehoods would become visible"

"Doctor" Faktorovich, you will literally point to a formula of Z-scores, (i.e. relative frequencies and means and standard deviations) and claim that, because I did not explain these basic statistical concepts in detail to you in that very post, the whole method is suspicious and I'm hiding my underhanded data dealings.

Conclusion: Any part of the process that I did not carefully explain to you is an excuse for you to lie about data manipulation. Any part you happen to misunderstand becomes another accusation.

That's not how people other than you engage in discussion, "doctor" Faktorovich. It's dishonest. It's bad faith. And people who systematically engage in bad faith discussions are not worthy of my time.

You've already reached the conclusion that not-your-method is "fraudulent" (as per >331 faktorovich:), and no amount of hoop-jumping is going to prevent you from making a lazy, knee-jerk accusation of DaTA mAniPUlaTiON.

365Keeline
mayo 18, 2022, 6:03 pm

>346 faktorovich:

AI or "neural networks" are designed to be "opaque" because they are black-boxes where absolutely nothing can actually be done, while the "programmer" draws a pretty diagram with the conclusion that is in his own or his client's self-interest. If a product or service you are selling cannot be audited, or an auditor cannot look inside the system to determine how the solution was arrived at; then, this system should not be used to change the bylines of classic texts, or to shift billions in invested capital.

You seem to like black box systems since you use AnalyzeMyWriting.com and WordSmith.com where there is no access to their source code and only what they say about their methods, product, and caveats. Why complain about this now?

Google Translate is not run by "AI", but rather with a checkable program and dictionaries.

Where may I check this program for Google Translate? I have not seen that the code is "open source" for consultation.

I see many sites that show how to use the Google Translate API and programs that use them. But I don't see that Google has shared its translation and interface system with the rest of the world like open source projects do (and this includes Stylo and R but not CLAWS which is a commercial product but free to use within some criteria).

James

366prosfilaes
mayo 18, 2022, 7:29 pm

>352 faktorovich: Yes, most books about chess statistics cite this "Shannon number", but it does not take into account pieces being consumed as the game progresses,

There's 400 possible initial pairs of moves, but the calculation is based on only 100 possible moves in a pair of moves. Also, as pieces are consumed, pieces that can move across the board can move across the board. Again, you didn't research anything before dismissing it.

And if a game ends in 10 moves because the computer picks the ideal moves in those 10 moves, then it does not have to perform astronomic quantities of calculations

100 ^ 10 is 100 quadrillion. If a computer can grind out 10 billion positions a second (which seems a bit high for a standard desktop computer), that will take 3 centuries to calculate. That's if the game ends in ten moves; Deep Blue's games against Gary Kasparov ran 37, 73, 39, 50, 47, and 43 moves for the first match, and 45, 45, 48, 56, 49, 19 for the second. Since these games mostly ended by mutual agreement rather than holding on until the very end, a computer would have needed to examine beyond the end of those games.

Yes, with the aid of specialized chess computing hardware and a supercomputer (by 1997's standards), Deep Blue managed to beat Kasparov in the chess match. That's why they turned to Go, which has somewhere around at least 10 ^ 10 ^ 100 possible games, and the only thing that has beat it has been neural networks.

I, as well as everyone else who has dabbled in this, realized that the number of positions to look at quickly swamps the computer, and even a basically competent player of any such game is going to need a lot of tuning to know what moves to look at and what moves to not bother looking at.

If you are attributing the win to an artificial intelligence, you are saying you have not created the necessary statistical etc. programming, and the victory was mostly just random or "luck".

Why yes, I'm sure AlphaGo beat the world's best go player by luck. You could read the books I pointed you to, and work the problems, but it's easier for you to pontificate on it, isn't it.

367faktorovich
mayo 18, 2022, 9:03 pm

>358 lorax: I already said the idea is absurd. Feel free to respond with any knowledge you are holding.

368faktorovich
mayo 18, 2022, 9:10 pm

>359 Keeline: "The mass of the sun is... about 333,000 times the mass of the Earth": https://www.space.com/17001-how-big-is-the-sun-size-of-the-sun.html#:~:text=Mass....

The largest planet in the solar system is "Jupiter's mass is more than 300 times that of Earth": https://skyandtelescope.org/astronomy-resources/smallest-and-largest-planets-in-....

So, even if each sun has 20 Jupiter-sized planets, they would add up to less than 2% of the mass of the sun. So even if we are missing all of the mass of all of the planets, but have detected all of the stars in the universe and accounted for their precise mass; the mass of the planets would be a tiny fraction of all the mass in the universe.

369faktorovich
mayo 18, 2022, 9:33 pm

>360 lorax: There is a simple quantitative perspective to the problem you are describing. "The Large Hadron Collider took a decade to build and cost around $4.75 billion." There is a new collider that is fund-raising: "This one could end up being 100km, almost four times the size, and may cost up to $23 billion to produce." https://www.cnet.com/science/cern-wants-to-build-a-new-23-billion-super-collider.... These numbers indicate that if these colliders' function can be made to sound inconceivably convoluted; then, investors, auditors and the public would be likely to continue investing in building them, even after the one already built cost an astronomical figure. If a $5 billion collider cannot achieve the task set for it, the solution is not to spend X5 more to make a bigger collider, but instead to stop doing science-fiction instead of pure science. There is a direct balance between this planet's spending on science that has no practical use to humanity, and its failure to actually invent things that would help us escape the planet's environmental destruction; either one or the other is going to be invested in and the fiction-writers have been winning for centuries. A new dimension is not going to open if humanity wastes billions on nonsense; the money is going to go into the pockets of the fiction-writers, and the four dimensions will still remain as the only dimensions reality has concocted.

Comparing more than 3 axis is impossible in a single visualization where there can only be a 3D X, Y, Z set of axis. To compare more than these items obviously can be done, but not in a visual diagram where all of them are included in a single diagram. If the number of types of data being compared is over 4; this does not equate with new dimensions beyond the 4 dimensions of space-time. Most systems give data in 2D or compare type of data X vs Y. This is why I designed several steps where I calculate the data in 2D in each step, but the combined data ends up comparing texts in more than 4 axis, or: all texts vs all texts vs 27 different tests (each of which is a separate axis before the data is combined). Previous computational-linguists have not provided information on what precisely happens to their raw data to arrive at an attribution result; this data being complex and multi-axis is not an excuse because it is every researcher's responsibility to present their data and conclusions in a format that allows auditors to check their work.

370faktorovich
mayo 18, 2022, 9:38 pm

>362 Keeline: As I have explained the 27 tests give precise data that I provide to the public on GitHub; thus, researchers can also manipulate or test this raw data in the spreadsheet. Turning this data into binaries is an essential step to create a 2D comparative table, or otherwise it would be a 284X284X27+ table that would be incomprehensible for readers who try to figure out what it is saying. Using 27 quantitative tests is the opposite of science fiction; it is basic math.

371faktorovich
mayo 18, 2022, 10:15 pm

>363 Petroglyph: Cambridge defines data as "information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer". The texts you are analyzing is not your "data", but rather your "Independent variable" or the things that are changed or examined. Calling texts "data" is like calling the human subjects (or photographs of them) in a drug-experiment the "data" before they are tested in any way to determine their baseline, or given the drug in question. I already stated that you should provide your "analysis", if you consider it to be everything you actually do in your experiment, or all of the quantitative figures you derive in the step(s) you take.

"it's about comparing a text to an authorial signature generated from a body of texts." In other words, your test can only reach an attribution conclusion if you are sure about which texts Author A wrote and have created a cluster of these texts to establish the signature. However, the problem I have noticed in almost all previous attribution studies of the Renaissance is that they assume all (or almost all) of "Shakespeare" (or another byline) is known to be by Author A. This would work if there was indeed a single author called "Shakespeare" who wrote all of these texts, but by comparing all 284X284 texts against each other, I learned that the currently "Shakespeare"-bylined texts fall into 5 different authorial groups with different signatures. Thus, comparing a mystery text X against all texts currently attributed to "Shakespeare" creates absolutely irrational outputs that cannot possible establish a logical attribution; the likely outcome will be that the ghostwriter who wrote most of the "Shakespeare" texts (Percy or Jonson) would come out as the most dominant signature, or the data will be a nonsensical combination with a signature somewhere between all 5 ghostwriters. This expected data-irrationality is the likely reason computational-linguists in this field never compare a mystery text against all texts by "Shakespeare", and never reveal the raw data with the stages of the calculation to allow for auditing of these results. It is easier to come up with a final graph that suggests the current bylines are correct than to do what I did or retest all 284 major texts from this period as if there are no firmly known bylines, and all have to be verified or contradicted.

The flaw that I previously pointed to of it being impossible to compare 100 or 1000 most-frequent words in 284 different texts and to come up with a rational set of data that gives an attribution answer because you are failing to simplify the data or to instruct the method on how to determine which words or what frequencies of appearance are relevant and what these indicate about an attribution decision. Are you comparing Text k against all 37 texts in the "Shakespeare" corpus, or have you come up with some sort of an average between these. It would be impossible to set an "average" when each one of the 37 texts has a different combination of words in it. Or are you comparing only Text k or one of "Shakespeare's" "known" texts against the mystery Text a? Then, you are basing your conclusion on a match only between these two texts, and failing to check if there is a similar match between many otherwise bylined texts in the corpus. And the formula itself does not provide space to fit more than a single text, as the letters mean items that fit the dimensions of a single text. There are many other logical flaws in this method, so if you just refuse to share the precise stages of your analysis; the only explanation is that you cannot share these without being found to have give fraudulent conclusions.

And now you are saying that you are turning your data into binaries as well, and in the same response when you are still criticizing my use of binaries? No, you have not stated that you use binaries.

The texts I use are cited in my bibliography with their sources, so other users can upload them from the web to re-test them. It would take up too much data and it would be unethical for me to take an EEBO/ Gutenberg book, perform a basic pre-test edit, and then re-publish it on my own website. Even if these sites release public domain rights, it does not feel like the right thing to do to republish their work on my own website (as you have done when posting these texts in your link).

Yes, I have posted "the actual percentages from AMW" on GitHub. You just haven't found this data, or haven't looked for it. This includes the precise passive voice, parts-of-speech etc. percentages etc. If you haven't found this until this point, it is amazing how many nonsensical insults you can shoot at my method without ever actually checking what my research findings are.

372faktorovich
mayo 18, 2022, 10:24 pm

>364 Petroglyph: I have indeed already explained how all of the "data" (or mostly concluding visuals and unexplained concluding figures without the steps that led to them) that you have provided so far has indeed been manipulated, misleading, and unsubstantiated. You have not been responding to any of my specific claims of manipulation or falsity, as instead you have chosen to ignore them by pretending they do not exist. And now you are taking some of my statements out of context, and still refusing to actually address the criticisms they have raised.

The problem is not that you have failed to define terms such as "relative frequencies", but that you have failed to explain how these can possibly be applied to checking the relative frequency of 1000 or all the different words in a corpus of 284 different texts.

Yes, if a researcher refuses to specify the exact steps involved in taking raw data to a conclusion; they are by-definition hiding their method. And an absence of a precisely defined method means that a researcher can manipulate data within their black box without an auditor being able to prove beyond all data if or if not manipulation has taken place (because the precise steps taken cannot be duplicated, since they are unknown).

373faktorovich
mayo 18, 2022, 10:31 pm

>366 prosfilaes: It is probable that the computer was able to calculate the best approaches to win just by learning the few thousand standard games that are taught in chess textbooks, or by steering games in directions textbooks advise when facing an approximate set of circumstances. Since we are imagining hypotheticals, alternatively (assuming that no existing computer actually could run the calculations required in the time available), just like fixing a football game, the computer programmer could have colluded with the human chess player for the human to take a fall after following a pre-determined set of moves. This would have given the computer programmer more work in AI development, and the human could have gotten a kick-back.

374paradoxosalpha
mayo 18, 2022, 11:27 pm

Nothing is easier for Dr. Faktorovich to imagine than mendacity. Nothing at all.

375prosfilaes
mayo 18, 2022, 11:38 pm

>373 faktorovich: Everyone with a little education knows that time starts at 00:00:00 UTC on January 1st, 1970, as Ken Thompson and Dennis Ritchie decreed. Even if you are part of the heresy that believes that time_t's run backwards, that still only goes back to 1901. Therefore, all this stuff about a British Renaissance before then is just absurd, a way for people to pretend to be scholars and take money from mathematicians, computer scientists and other real scholars. Notice that it wasn't enough to make up a bunch of stuff written 300 years before the start of time, they had to pretend that it was a Renaissance, a rebirth of a time period hundreds of years before that. Just look at Tolkien; he made up Beowulf and the Green Knight, but apparently the Hobbit was a bit too far, so they ganged up on him and declared that part fictional, as if it was any difference than the other stuff he created.

376Petroglyph
mayo 18, 2022, 11:41 pm

377Keeline
mayo 19, 2022, 12:53 am

>376 Petroglyph:

My understanding was that Galileo got in trouble for adopting and defending the model of Copernicus. He added to it with his telescopic observations of the moons of Jupiter, etc. Of course being in Italy within reach of the Pope and Catholic Church did not help things since they were sold on the ideas of Aristotle.

James

378Petroglyph
mayo 19, 2022, 12:46 pm

>377 Keeline:
That, and the whole "I'll write a book and present heliocentrism as a hypothesis (as requested), but I'll put the geocentrist arguments in the mouth of Simpleton and poke fun at that guy and compare him to pope Urban."

379faktorovich
mayo 19, 2022, 1:26 pm

Here is how Richard Verstegan describes the origin of the "Anglo-Saxon" people in Britain in "Restitution" (the "Anglo-Saxon" term is credited by many as originating from this book): "Hengist and his brother with their forces, having arrived in the Isle of Thanet in Kent, in the year of our Lord before specified 447 AD, and in the second year of the reign of King Vortigern, as before has also been said, were by their said King greatly welcomed. And upon marching against his northern enemies, the Scots and Picts, they valiantly encountered them in battle, and overthrew them, whereby they gained for themselves most great honor and reputation. Hereupon, Hengist desired from King Vortigern so much ground as with the skin of a bull he could compass around; which having obtained, he cut out a large bull’s hide into very small thongs, leaving them still fastened one to the other, and having by this means brought the whole skin as it were into one thong of a great length, he laid it in a compass on the ground, and so accordingly laid the foundation of a castle that he finished and called the Thong Castle, situated near to Sittingbourne in Kent. And in this Castle, he afterward feasted King Vortigern, as soon I will declare. With this Castle thus built, Hengist sent home word to his country of Saxony as well of the good success he had against King Vortigern’s enemies, as of the goodness and fertility of the soil, and the lack of warlike courage in the Britons. Whereupon, a larger navy and a number of footmen were sent over out of Saxony to him, and these consisted of the three principal sorts of Saxon people, namely: of those who without distinction bore the name of Saxons, of those who were particularly called the English, and of those who were called the Jutes." In an earlier passage, Verstegan comments that the different versions of this legend are blatantly fictitious and contradictory and are not to be believed (and yet scholars have taken this legend to be factual and still call Britons "Anglo-Saxons"): "And here I cannot but wonder at Occo Scharlensis, of whose little credit I have spoken in the Second Chapter, who tells us first of two brethren called Hengist and Horsa, the sons, as he says, of Udulphe Haron, Duke of Friesland, and that Hengist, who was two years older than his brother Horsa, was born in the year of our Lord 361; by this account, Hengist, when in the year of our Lord 447 he came to the aid of King Vortigern, must have been 86 years old, and so a very unlikely man to bear arms. He further tells us of another two brethren, also called Hengist and Horsa, sons as he says, to Odilbald King of Friesland, who were born in the year 441, and who, he says, also went into Britain, to revenge the deaths of the aforesaid Hengist and Horsa; and they were both slain; but their soldiers, notwithstanding—upon being animated by Gorimond, an Irish Captain (who with many Irishmen had joined them), to revenge the deaths of their princes—fought so valiantly that they obtained the victory, and made Gorimond their King, after whose death they chose one of their own nation to that dignity. Look, here we see that Occo has not only found out about other parents for our Hengist and Horsa than those assigned by the venerable Bede (though he lived 200 years before Occo), and also these were parents who were unheard of by any other previous author, and Occo also finds out about a second Hengist, and a second Horsa created to revenge the deaths of the former; and he deserves in both of his relations to be believed alike."

This is an example of how legend and historical fiction have been turned into unshakable history. The fact that it used to be fiction was directly acknowledged by those who wrote such historical fictions in the Renaissance in their own writings. Only because modern scholars rarely go back to read ancient or Renaissance texts to trace the origin of "history" do they fail to recognize that the "history" they are repeating is fictitious.

380Keeline
mayo 19, 2022, 2:41 pm

>379 faktorovich:

This wall of text would be easier to read if you used the <blockquote> and </blockquote> tags around your quoted text. It is hard to see where the quote leaves off and your commentary picks up again. If you adopted this, it would look more like this:
_____

Here is how Richard Verstegan describes the origin of the "Anglo-Saxon" people in Britain in "Restitution" (the "Anglo-Saxon" term is credited by many as originating from this book):

Hengist and his brother with their forces, having arrived in the Isle of Thanet in Kent, in the year of our Lord before specified 447 AD, and in the second year of the reign of King Vortigern, as before has also been said, were by their said King greatly welcomed. And upon marching against his northern enemies, the Scots and Picts, they valiantly encountered them in battle, and overthrew them, whereby they gained for themselves most great honor and reputation. Hereupon, Hengist desired from King Vortigern so much ground as with the skin of a bull he could compass around; which having obtained, he cut out a large bull’s hide into very small thongs, leaving them still fastened one to the other, and having by this means brought the whole skin as it were into one thong of a great length, he laid it in a compass on the ground, and so accordingly laid the foundation of a castle that he finished and called the Thong Castle, situated near to Sittingbourne in Kent. And in this Castle, he afterward feasted King Vortigern, as soon I will declare. With this Castle thus built, Hengist sent home word to his country of Saxony as well of the good success he had against King Vortigern’s enemies, as of the goodness and fertility of the soil, and the lack of warlike courage in the Britons. Whereupon, a larger navy and a number of footmen were sent over out of Saxony to him, and these consisted of the three principal sorts of Saxon people, namely: of those who without distinction bore the name of Saxons, of those who were particularly called the English, and of those who were called the Jutes.

In an earlier passage, Verstegan comments that the different versions of this legend are blatantly fictitious and contradictory and are not to be believed (and yet scholars have taken this legend to be factual and still call Britons "Anglo-Saxons"):

And here I cannot but wonder at Occo Scharlensis, of whose little credit I have spoken in the Second Chapter, who tells us first of two brethren called Hengist and Horsa, the sons, as he says, of Udulphe Haron, Duke of Friesland, and that Hengist, who was two years older than his brother Horsa, was born in the year of our Lord 361; by this account, Hengist, when in the year of our Lord 447 he came to the aid of King Vortigern, must have been 86 years old, and so a very unlikely man to bear arms. He further tells us of another two brethren, also called Hengist and Horsa, sons as he says, to Odilbald King of Friesland, who were born in the year 441, and who, he says, also went into Britain, to revenge the deaths of the aforesaid Hengist and Horsa; and they were both slain; but their soldiers, notwithstanding—upon being animated by Gorimond, an Irish Captain (who with many Irishmen had joined them), to revenge the deaths of their princes—fought so valiantly that they obtained the victory, and made Gorimond their King, after whose death they chose one of their own nation to that dignity. Look, here we see that Occo has not only found out about other parents for our Hengist and Horsa than those assigned by the venerable Bede (though he lived 200 years before Occo), and also these were parents who were unheard of by any other previous author, and Occo also finds out about a second Hengist, and a second Horsa created to revenge the deaths of the former; and he deserves in both of his relations to be believed alike.

_____

Handling long quotes in any academic writing has certain rules and we do have features of HTML that work in LT Talk that facilitate this. The goal is to add clarity. Yes, this is an Internet forum but add some visual cues to help with the communication.

At the moment this comment seems to be a non sequitur since it is not clear what that has been discussed before precipitates this post. It is interesting. I am interested in the concept of "how do we know what we know" and realize that modern methods of documenting history are different than the centuries that preceded it.

James

381paradoxosalpha
mayo 19, 2022, 4:14 pm

I agree that the Anglo-Saxon etiological narrative analysis by Verstegan is interesting in its own right, though not obviously integrated into the prior discussion. It puts me in mind of the excellent study Giants in Those Days: Folklore, Ancient History, and Nationalism, which focuses on the use of gigantology in historical discourses about antiquity in early modern Europe. Stephens proposes reading Rabelais as a criticism of such histories. (I have a longer review on the LT book page.)

382raidergirl3
mayo 19, 2022, 7:47 pm

>378 Petroglyph: Galileo was like Colbert from The Colbert Report.

383faktorovich
mayo 19, 2022, 9:13 pm

>380 Keeline: Given the budget LT has, they really would create a feature they allows users to automatically create block quotes or indentation as well as other features that can be used in a typical word-processing software. It should not be up to users to program each of their posts with HTML. So direct your criticism of how posts on LT look to LT and not to its users. I added quotes in the correct places and inserted comments only before or after a quote in a paragraph, and not in the middle of it.

Yes, as you have gathered my comment is explaining the answer to the complex question of how we know things that have been categorized as established "history" (such as that Elizabeth I spoke many languages or James I spoke English or not, as well as the question of if the term "Anglo-Saxon" and the idea that Britons come from Germany's Duchy of Saxony), and if modern historians who are "documenting" their history books are citing sources like Verstegan or later writers who plagiarized from Verstegan without giving him credit, without checking if what they are repeating has any basis in earlier sources or if it is a fiction that was invented at the start of the re-quoting path.

384faktorovich
mayo 19, 2022, 9:22 pm

>381 paradoxosalpha: The BRRAM series explains many parts of British history and culture in an entirely new way by translating sources for the first time that have remained mostly unread until now and researching other ancient and Early Modern sources to explain these ideas further. These are just a couple of paragraphs out of one of these 17+ books that all of you have been dismissing as unnecessary to read to reach a conclusion about my findings. I also cover the reasons why these ghostwriters are the only possible authors of these 284 works across the series. My computational method is only a tiny fraction of these larger findings. And I am not studying Verstegan's origin legends to argue about any truth in these accounts, but rather to point out that Verstegan did not think there was any truth in these legends in this account, but later critics have only quoted the sections that make it seem like historical facts.

385lilithcat
mayo 19, 2022, 10:02 pm

>383 faktorovich:

Given the budget LT has

How do you know what LT’s budget is?

386amanda4242
mayo 19, 2022, 10:26 pm

>383 faktorovich: Given the budget LT has, they really would create a feature they allows users to automatically create block quotes or indentation as well as other features that can be used in a typical word-processing software.

*snort* You really don't do your research, do you? LT is not a large company and they have all of four developers to work on the entire site (see https://www.librarything.com/whoweare.php). Improvements to Talk are in the cards for the ongoing site update, but, again, they are a small company and are unlikely to make a priority of something that members can do with a simple tag.

The polite response to >380 Keeline: would have been something along the lines of "Thank you for explaining how to do blockquotes. That will certainly be helpful when I'm quoting long sections." I guess your manners are on par with your research skills.

387prosfilaes
mayo 19, 2022, 10:30 pm

>383 faktorovich: Like the idea that Vergestan died in 1640? Again, if he didn't die in Antwerp in 1640, I don't why you can assume he died in 1640.

388Petroglyph
mayo 19, 2022, 10:33 pm

>371 faktorovich:

You continue to confuse "data" with "analyses".

The texts you are analyzing is not your "data", but rather your "Independent variable" or the things that are changed or examined

Lol. More confusion.

In other words, your test can only reach an attribution conclusion if you are sure about which texts Author A wrote and have created a cluster of these texts to establish the signature.

Reasonably certain is enough. And if there's any doubt, you can test texts to see if they are similar.

I have noticed in almost all previous attribution studies of the Renaissance is that they assume all (or almost all) of "Shakespeare" (or another byline) is known to be by Author A

There's plenty of fretting in Shakespeare authorship studies about which texts can be comfortably assigned to him, and which cannot be. Plenty of (e.g. testing the body of works purportedly by Shakespeare, except one randomly selected work, which is then compared to the n-1 body. Do this for all works. Plenty more tests have been done.) Your assessment of this field is, well, contentious and uninformed. But then again, you posit a "rival" methodology that you try and make money off of, so I suppose your bias in wanting to promote your stuff over established studies is unavoidable.

I learned that the currently "Shakespeare"-bylined texts fall into 5 different authorial groups with different signatures.

Sure. And the tragedies were different from the comedies, right? Must be different authors. (/sarcasm)

The flaw that I previously pointed to of it being impossible to compare 100 or 1000 most-frequent words in 284 different texts and to come up with a rational set of data {...} It would be impossible to set an "average" when each one of the 37 texts has a different combination of words in it.

It's really not. Compare relative frequency of, say, the across all texts. Then compare of, and then a, and then in, and so on, down the list. Nothing impossible or even hard to understand about it. A kindergartner could grasp the procedure.

In this fashion, it is very much possible to compare relative frequencies of thousands of words across hundreds of texts. The possibilities in reality are wider and more diverse than your imagination (or your anti-not-your-method bias) will allow.

All these calculations would take ages for a human. But a computer does many thousands of these in a second. That is why we use computers instead of doing everything by hand. You should try it some time.

Are you comparing Text k against all 37 texts in the "Shakespeare" corpus, or have you come up with some sort of an average between these? {...} Or are you comparing only Text k or one of "Shakespeare's" "known" texts against the mystery Text a?

Look. The "problems" you've just thought of while trying to understand the sum of average differences between two lists of Z-scores aren't problems at all. If you have to ask these questions, then you haven't understood the formula, and you are not in position to offer criticism that other people have to take seriously.

There are many other logical flaws in this method, so if you just refuse to share the precise stages of your analysis; the only explanation is that you cannot share these without being found to have give fraudulent conclusions

Again, you are jumping to the conclusion "this is suspicious". That is your inference.

There is another explanation: I have not shared every single calculation because they are too mundane to be shared. Take the relative frequency of all words. Turn them into Z-scores. Subtract. Divide by n. Like, none of these are complicated steps; the formula clearly lays out the steps. What else is there to say? Are you really going to audit tens of thousands of subtractions? Tens of thousands of dividing by the standard deviation?

You have not shared which operations your computer does when it copy/pastes. Is that suspicious? Or too mundane to explain? Perhaps more to the point, you, Faktorovich, have not shared the calculations that your free online tools do to establish average sentence length or % of nouns. I don't mean the outcome of those calculations -- but the actual take-number-of-words-and-divide-by-number-of-sentences steps. You have shared precisely none of those.

The fact that you have not done so is not suspicious at all, because calculating the average or a percent is too basic to need elaboration. It's the same thing for Z-scores and standard deviations and all the other operations in that formula. People who do statistical stylometry know what Z-scores are and don't have to have this explained to them. Over and over again. For each graph.

And now you are saying that you are turning your data into binaries as well, and in the same response when you are still criticizing my use of binaries? No, you have not stated that you use binaries.

More misunderstandings. They are not binaries -- they are probabilistic estimates of closeness. They are not 0/1 either/or. They are not 0% vs 100% nothing inbetween like in your method. A text compared to another text using Burrows' Delta will give you a probabilistic measure of how close two texts are. Or how close a text is to the body of an author's work.

Incidentally, those estimates are never 0% or 100%. 100% similarity means that two texts are completely identical. 0% means that they share absolutely nothing, not even a single word. If you get a 0% or a 100% similarity rating, you've made a mistake somewhere.

it does not feel like the right thing to do to republish their work on my own website (as you have done when posting these texts in your link)

One: I have not published these texts. I have made them available.
Two: PG's licence allows me to do exactly what I have done. From PG's "small print":

You may distribute copies of this etext electronically, or by disk, book or any other medium if you either delete this "Small Print!" and all other references to Project Gutenberg, or only give exact copies of it. {...} Among other things, this requires that you do not remove, alter or modify the etext or this "small print!" statement. You may however, {...} provide, or agree to also provide on request at no additional cost, fee or expense, a copy of the etext in its original plain ASCII form

Would you like me to provide full-PG texts for any PG-based corpus file I have shared?

Yes, I have posted "the actual percentages from AMW" on GitHub. You just haven't found this data

I must have missed that, yes. Can you link me to the table that contains those results? Once you do that, I'll emend my claim in post 363.

389Petroglyph
mayo 19, 2022, 10:36 pm

>372 faktorovich:

I have indeed already explained how all of the "data" {...} has indeed been manipulated, misleading, and unsubstantiated

You have made empty accusations based on misunderstandings and the misguided notion that scientific enterprise ought to be able to be understood by a child. Peaked in primary school? Saying something does not mean explaining it. Agreeing with your own claims does not mean that they are convincing to other people. (Else everyone would be christian, communist, zoroastrian, existentialist or keynesian.)

The problem is not that you have failed to define terms such as "relative frequencies", but that you have failed to explain how these can possibly be applied to checking the relative frequency of 1000 or all the different words in a corpus of 284 different texts

You continue misunderstanding how a list of relative frequencies for all the words in one text can be compared to a list of relative frequencies for those same words in another text. It's really not that hard.

if a researcher refuses to specify the exact steps involved in taking raw data to a conclusion; they are by-definition hiding their method

... but the jump to "this is suspicious" is entirely yours. In explaining your method, you did not explain how to open up a spreadsheet app, or how to copy/paste, or how to access a web site. You did not do so because those steps are, indeed, so basic as to not merit explaining in detail. You can assume your audience is aware of how to perform those steps.

Similarly, to people engaged in stylometrics, that is, the use of statistical methods to analyze textual features in order to derive information about the author, techniques like Z-scores and Burrows' Delta and distance measures and normalization and so on can be assumed to be received knowledge. Your insistence that these steps have to be elucidated explicitly are really, really bizarre.

For example: you have not shared the actual calculations that your free online services do to count the number of nouns in a text, or the average sentence length, or the commas-per-hundred-sentences. I mean: the actual "take the total number of comma's, divide by number of sentences and multiply by 100" calculations. It would be really really bizarre if anyone were to accuse you of fudging the data for not sharing the actual calculations that yield these figures.

Normally, people without the relevant statistical background (and sufficient training in the methods) don't commit to critiquing methods they do not even understand. But your self-assessment of your own understanding is, as usual, far off the mark.

That, or you're just looking for any excuse to say something damning, something that sounds like a fundamental critique. After all, you need to sell your books, and the laziest way of talking up your own stuff is to disparage whatever not-your-stuff is.

Or perhaps both. Why not both? Yeah. Let's go with that.

390amanda4242
mayo 19, 2022, 10:43 pm

>388 Petroglyph: And the tragedies were different from the comedies, right? Must be different authors.

Well, that might explain why the jokes in the tragedies are funnier than those in the comedies. ;)

391Keeline
mayo 19, 2022, 11:11 pm

>383 faktorovich:

Given the budget LT has

Budget? They don't charge for cataloging memberships. They have some partnerships with libraries and Amazon but this is not a billion dollar company.

I thought there might be a (Chrome) browser extension that might allow one to add tags around some selected text. However, I don't see one in my searches. I have a question out to one of my colleagues to see if it exists or could be made into a browser extension or TamperMonkey script. I feel sure it can be but it is a matter of knowing how to do it. I found some code for the key functionality.

While I am comfortable inserting the tags I want, I recognize that is not in everyone's comfort zone.

The alternative would be to use the return/enter key more often, especially with long quotes.

James

392Petroglyph
mayo 19, 2022, 11:24 pm

>390 amanda4242:

The old adage: it isn't funny unless someone gets hurt. There's a lot of truth in that.

So many good lines. Now I want to see/read a play again, but I don't have the time :(

393Keeline
mayo 19, 2022, 11:27 pm

>390 amanda4242:

The Comedies in the Shakespeare corpus are generally not intended to be "funny" in the way that we define "comedy" to be today. In general the distinction between a "comedy" and a "tragedy" (as identified in the titles of the quartos and especially the First through Fourth Folio editions) is that a comedy ends in marriage or the promise of marriage. There can be just as much death, treachery, and misunderstandings. Compare Othello with Much Ado About Nothing and you will find many similar themes and elements. The first is classified as a tragedy and the latter is a comedy.

This won't come as a surprise to anyone familiar with the works attributed to Shakespeare.

James

394amanda4242
mayo 19, 2022, 11:54 pm

>392 Petroglyph: Try The Complete Works of William Shakespeare (Abridged). All the plays in just 90 minutes!

395amanda4242
mayo 19, 2022, 11:58 pm

>393 Keeline: I know. I was just pokin' fun. :)

396Keeline
mayo 20, 2022, 12:06 am

>394 amanda4242:

We've seen the RSC a few times years ago. Having all of the history plays as a football game fighting over the crown. Throw out the penalty flag — fictional character on the field. :)

The comedies are treated like the commercials for the musical "Seven Brides for Seven Brothers."

Great fun.

James

397Petroglyph
Editado: mayo 20, 2022, 1:12 am

>394 amanda4242:

I did not know that existed. The Romeo&Juliet part was fun. I'll save this for later today! Thanks!

398Petroglyph
mayo 20, 2022, 10:15 am

>382 raidergirl3:

He was a cheeky monkey sometimes, yeah, trolling the authorities.

399Petroglyph
mayo 20, 2022, 10:34 am

>383 faktorovich:

later writers who plagiarized from Verstegan without giving him credit

Post hoc, ergo propter hoc. Rookie mistake.

>379 faktorovich:

The story of Hengist and Horsa goes back to at least Bede (7th Century). I remember reading about them when I translated portions of the Anglo-Saxon Chronicle (9th-12thC).

(For clarity: the Anglo-Saxon Chronicle is a series of annals that note important events for each year. The chronicle was started (probably) under Alfred in the late ninth century, but the "history" it contains goes back to about 60BC, when Caesar made his foray into Britain. These annals then chronicle the history of Britain and Rome and Christianity as the literate class of Anglo-Saxon societies saw/knew it until contemporary times. New events were chronicled until well into the 1100s. )

and yet scholars have taken this legend to be factual

From the wikipedia page for Hengist and Horsa:

Most modern scholarly consensus now regards Hengist and Horsa to be mythical figures, and much scholarship has emphasised the likelihood of this based on their alliterative animal names, the seemingly constructed nature of their genealogy, and the unknowable quality of the earliest sources of information for their reports in the works of Bede. Their later detailed representation in texts such as the Anglo-Saxon Chronicle can tell us more about ninth-century attitudes to the past than anything about the time in which they are said to have existed.

Your assessment of what historians think is way off the mark.

Historians are not just interested in what actually, factually happened. They are also interested in past societies qua societies: what did they think, how did they reason, what were their ideas about their own history?

Naturally, historians don't trust everything Livy or Sallust or Suetonius or Plutarch say. But those works are a) a direct statement of actual opinions that some of the actual people of the time had about their own history, and b) some of the few texts that actually have come down to us from the relevant period -- we're forced to work with texts like these. If you're interested in how ninth-C people in Britain saw themselves, the Hengist and Horsa story is of value, as is their account of the pre-Alfred kings and the succession of Roman emperors because it featured in their mental model of the world and the place of their people in it. Or at least in some people's mental models. And it's what they based their policies and actions on.

still call Britons "Anglo-Saxons"

Also, "Anglo-Saxons" is a long-standing term used in academia and elsewhere, because it's the exonym that's been used for pre-Norman societies (and sometimes their descendants) in Britain for many many centuries. Using it does in no way, shape or form imply that you accept any historical myth. Think of it like the word "sunrise": it's factually inaccurate, but also the normal term, and it's not likely to be changed any time soon. And, of course, using words "sunrise" and "sunup" and "sundown" does not imply embracing geocentrism. They're just words that stuck around. Same thing with "Holland" and "the Netherlands".

"Britons" is often used for Celtic-speaking pre-Roman and pre-Germanic populations of Iron Age Britain.

Occo has not only found out about other parents for our Hengist and Horsa than those assigned by the venerable Bede (though he lived 200 years before Occo)

Conflicting stories about founding figures does not necessarily mean there is no historical truth at all. But a careless reader and thinker might equate those two.

I'll also note that Verstegan treats Bede as authoritative merely because his writings are older. Rookie mistake.

(Sidenote: why do you accept Verstegan's assessment of the situation so uncritically? Is it because he's casually dismissive of his sources and that makes him sound like he knows what he's talking about?)

Only because modern scholars rarely go back to read ancient or Renaissance texts to trace the origin of "history" do they fail to recognize that the "history" they are repeating is fictitious

What the hell are you talking about? Historians are all about going back to whatever original sources still exist! Historians are forced to use Bede and the Anglo-Saxon chronicle because those are some of the precious few writings that have come down to us from actual Anglo-Saxon societies. The layers of myth-making and propaganda inherent in those texts are unfortunate, and they require careful handling, but they are the best historiography we've got. They're basically the only historiography we've got from pre-Norman Britain.

Have you read Bede? His calendrical calculations and Easter Tables are impressive!

Let's take a Shakespearian example.

The life of the Roman general Coriolanus, who supposedly led Rome's enemies against her after being snubbed, is probably a myth. but it's a long-standing story that was accepted as true -- probably in part because it's a good story with lots of drama.

Today we think Shakespeare's Coriolanus is based on a myth. But at the time Billy from Stratty wrote that play, the story was regarded as basically true -- as true as the events involving Julius Caesar, Mark Antony and Cleopatra. Some good stuff from Antiquity to make a play out of. And enjoying the play does not mean that you accept a myth as real verifiable history.

Whether Hengist, Horsa or Coriolanus were real people, or whether and how much of the material about them has a real historical truth to it are valid concerns sometimes, depending on the questions you want to ask and the time period you're interested in. But if you write about people who thought that the stories about Hengist or Coriolanus had much historical veracity to them, or about four hundred year old attitudes to those stories (or from 800 years ago, or 1200 years), the question of historical veracity becomes much less relevant: the people you write about thought these stories were true (or not), and then their attitude is the material you write about, not what you think really happened.

Dismissing foundational myths is fine if you're only interested in what actually happened (and to what extent youc can assess that). But if you're interested in the people who believed these myths, or in writings that treat these myths as real events, you're forced to incorporate them into your mental and historical horizon. Call it suspension of disbelief, if you will.

(See also: is the holy ghost real? Were the gospels written by eyewitnesses or people who spoke to eyewitnesses? Probably not, in both cases. But for large sections of the history of our societies, people did genuinely believe both of those things to be true. And they're some of the foundational concepts, so you can't just ignore them. Suspension of disbelief.)

But for someone like yourself, who is not interested in other people's thoughts and has a hard time adopting someone else's perspective for a second, your disdain for historians and the complexities of juggling multiple different mental horizons is unsurprising.

Once again, you imagine a very basic objection that is not even true, and your very next step is to immediately discard the whole enterprise. It's a bit of a theme with you.

400faktorovich
mayo 20, 2022, 1:13 pm

>385 lilithcat: I assumed there was a substantial budget. Are you saying there isn't one? I found this page on a quick search: https://www.owler.com/company/librarything

401faktorovich
mayo 20, 2022, 1:17 pm

>386 amanda4242: Just as Keeline perceived he was helpful by criticizing my insufficient use of html. I perceived I was helpful by suggesting that LT should provide an easier formatting option, especially if users are having trouble separating block quotes from surrounding comments as Keeline pointed out. It continues to amaze me how you guys can find fault in the same types of criticism when it is coming from me that you find flawless when anybody else is using it to say something negative about me.

402faktorovich
mayo 20, 2022, 1:20 pm

>387 prosfilaes: Where did I object to the idea that Verstegan died in 1640? I do have doubts that he was in Antwerp when he died, but this is not a point that I have raised anywhere in BRRAM or in this thread, as it is merely a doubt I have when asked if I have this doubt, rather than something significant I have contemplated before. Please clarify what you are trying to say?

403amanda4242
mayo 20, 2022, 1:29 pm

>401 faktorovich: It amazes me that you think Keeline's statements and yours are the same. Keeline was explaining how you could use a simple tag to make it easier for others to read your posts; you criticized LT for not doing it for you.

404Petroglyph
mayo 20, 2022, 1:43 pm

As a palate cleanser for all the tedious back-and-forth here, I thought I could talk about another book that some of the misunderstandings in this thread remind me of.

That book is The Young Visiters or, Mr. Salteena's plan, written in 1890 by then nine-year-old Daisy Ashford. It was published much later, in 1919, preserving the original spelling. Here it is on ProjGut.

The book is about Mr Alfred Salteena, "an elderly man of 42", who is not a gentleman, but wants to learn how to be one so he can get a job at Buckingham Palace. The other main character is Ethel, one of Mr. Salteena's protégées, who is a young girl of "a rarther lazy nature" but also "rarther curious by nature"; she's a social climber, too.

I'll quote a few representative sections.

The book opens when the pair are invited by Salteena's friend Bernard:

My dear Alfred.

I want you to come for a stop with me so I have sent you a top hat wraped up in tishu paper inside the box. Will you wear it staying with me because it is very uncommon. Please bring one of your young ladies whichever is the prettiest in the face.
I remain Yours truely
Bernard Clark.

My dear Bernard
Certinly I shall come and stay with you next Monday I will bring Ethel Monticue commonly called Miss M. She is very active and pretty. I do hope I shall enjoy myself with you. I am fond of digging in the garden and I am parshial to ladies if they are nice I suppose it is my nature. I am not quite a gentleman but you would hardly notice it but cant be helped anyhow. We will come by the 3-15.
Your old and valud friend
Alfred Salteena.

The pair have dinner at Bernard's place, and he also treats them to a tour of his portrait gallery:

Well said Mr Salteena lapping up his turtle soup you have a very sumpshous house Bernard.

His friend gave a weary smile and swollowed a few drops of sherry wine. It is fairly decent he replied {...} after our repast I will show you over the premisis.

Many thanks said Mr Salteena getting rarther flustered with his forks. {...} Mr Salteena was growing a little peevish but he cheered up when the Port wine came on the table and the butler put round some costly finger bowls. He did not have any in his own house and he followed Bernard Clarks advice as to what to do with them.

{...}

Then Bernard said shall I show you over my domain and they strolled into the gloomy hall.

I see you have a lot of ancesters said Mr Salteena in a jelous tone, who are they.

Well said Bernard they are all quite correct. This is my aunt Caroline she was rarther exentrick and quite old.

So I see said Mr Salteena and he passed on to a lady with a very tight waist and quearly shaped. That is Mary Ann Fudge my grandmother I think said Bernard she was very well known in her day.

Why asked Ethel who was rarther curious by nature.

Well I dont quite know said Bernard but she was and he moved away to the next picture. It was of a man with a fat smiley face and a red ribbon round him and a lot of medals. My great uncle Ambrose Fudge said Bernard carelessly.

He looks a thourough ancester said Ethel kindly.

Well he was said Bernard in a proud tone he was really the Sinister son of Queen Victoria.

Not really cried Ethel in excited tones but what does that mean.

Well I dont quite know said Bernard Clark it puzzles me very much but ancesters do turn quear at times.

Peraps it means god son said Mr Salteena in an inteligent voice.

Well I dont think so said Bernard but I mean to find out.

Like, look at that characterization! Salteena struggling with the gentlemanly skills of forks and finger-bowls, his opinions casually dismissed by his social betters; and Ethel eager to learn more about the world of ancestor-havers and offering supportive opinions of goodwill. It's brilliant!

By the end of chapter 5, Mr Salteena is receiving instructions on how to be more like a gentleman from an actual Earl:

You see these compartments are the haunts of the Aristockracy said the earl and they are kept going by peaple who have got something funny in their family and who want to be less mere if you can comprehend.

Indeed I can said Mr Salteena.

Personally I am a bit parshial to mere people said his Lordship but the point is that we charge a goodly sum for our training here but however if you cant pay you need not join.

I can and will proclaimed Mr Salteena and he placed a £10 note on the desk. His Lordship slipped it in his trouser pocket. It will be £42 before I have done with you he said but you can pay me here and there as convenient.

Oh thankyou cried Mr Salteena.

Not at all said the Earl and now to bissness. While here you will live in compartments in the basement known as Lower Range. You will get many hints from the Groom of the Chambers as to clothes and ettiquett to menials. You will mix with me for grammer and I might take you out hunting or shooting sometimes to give you a few tips. Also I have lots of ladies partys which you will attend occasionally.

Mr Salteenas eyes flashed with excitement. I shall enjoy that he cried.

His Lordship coughed loudly. You may not marry while under instruction he said firmly.

Oh I shall not need to thankyou said Mr Salteena.

You must also decide on a profeshion said his Lordship as your instruction will vary according.

Could I be anything at Buckingham Pallace said Mr Salteena with flashing eyes.

Oh well I dont quite know said the noble earl but you might perhaps gallopp beside the royal baroushe if you care to try.

Oh indeed I should cried Mr Salteena I am very fond of fresh air and royalties.

Well said the earl with a knowing smile I might arrange it with the prince of Wales who I am rarther intimate with.

Not really gasped Mr Salteena.

Dear me yes remarked the earl carelessly and if we decide for you to gallopp by the royal viacle you must be mesured for some plush knickerbockers at once.

Mr Salteena glanced at his rarther fat legs and sighed.

The solution to his pressing lack of knickerbockers, is, of course, something that would seem eminently reasonable to a nine-year-old and that I will not spoil here.

When nine-year-old Daisy Ashford writes her comedy of manners she does so with absolute sincerity and enormous confidence. Her book -- in my opinion -- is an accomplished piece of fiction, where the two storylines are ably intertwined with an arc of rising tension in both, and with admirable characterization and a pretty good grasp of the technicalities of constructing a story.

But it takes an adult to see that the book's understanding of how the world works is only superficial: Ashford can mimic some of the things that the grownups do and say, but the reason why they do and say those things she hasn't quite grasped. Her portrayal of class and jobs and the nobility and dress codes and financial transactions and romantic relationships and so much else besides is constrained by her own sense of logic and by decorum and fairness as she sees them. The discrepancies and the jumps in reasoning are only clear to someone who understands the intricacies that are a little bit beyond nine-year-old Ashford, who simply tells 'em like she sees 'em.

Anyway. No joke, Ashford's little book is one of the most charming things I've ever read. This short story/novella delivers a concentrated dosis of sheer delight, and I won't spoil any more of it. Go read it: it won't take much time at all, and it's the kind of clumsy that's disarmingly endearing in its sincerity. And the proposal scenes are at least as memorable as those in Pride and Prejudice!

405Keeline
Editado: mayo 20, 2022, 2:05 pm

>400 faktorovich:

That site makes a wild guess of the

Est. Annual Revenue
$100K-5.0M

and the

Est. Employees
100-250

For the size of the staff, I'd say it is less than 10. I think the Who We Are page confirms this with the 8 people profiled.

There are 4 developers and let's estimate for this calculation that they are in the $100K range. Other staff may make less. This would suggest a payroll in the $500K to $800K range, perhaps as high as $1M.

What money comes in (now) is from libraries who use the systems built for them. So that is why a lot of the development has been on those projects and far less on the regular site we use.

I am trying a GreasyFork script on TamperMonkey tonight and perhaps I can modify it to work with LT. If I get anywhere, I will let you know. There's a good chance it won't work though so I make no promises.

If it did work one would select the text and click a button to add common tags like blockquote and it would insert the text to the textarea input form. I might only get it to work on the bottom input since it has a static id and the ones that appear with the reply link have an id that is based on the reply. It is harder to make that a variable. I think it can be done but it is a question of whether I can figure it out.

Of course, you won't appreciate the effort anyways. But if I do get it going, it could be fun for me to use.

As far as criticism goes, you set yourself up by making the absolute claims that you don't make mistakes in the series or what you post here. That is not true. A little humility goes a long way but I'm not sensing that in your replies. I continue to stick to the statements you make and whether or not I agree with them. That is how this process works. There are other fora where the outcomes might be less pleasant for all concerned.

James

406faktorovich
mayo 20, 2022, 2:07 pm

>388 Petroglyph: "Reasonably certain is enough. And if there's any doubt, you can test texts to see if they are similar." No, your method depends on a given text to be absolutely known to be by Author X, or when it is compared against an Unknown Text, it cannot either prove or disprove that the Unknown Text is either by the same author or not. If there is doubt about all texts with the Author X byline (as is the case when critics question if there was a real author called "Shakespeare"); then, your test against these doubtful texts cannot come to any conclusion regarding authorship on the Unknown Text. The formula does not allow for testing of all texts in a corpus against each other to determine which fall into similar groups. Instead, it requires that you know in advance which texts have established bylines, to attribute anything else to these bylines. This method does not require any biographical research of the bylines that match each other, as you have pre-determined that the byline of Author X is known and all texts that match it must also be by X, and cannot be by any of the other bylines in the corpus that also match it. This is a quantitative solution that is blind to the possibility of ghostwriting, pseudonyms and other realities of authorship. No computational-linguistic author-attribution method can stop at the numbers without doing the actual "analysis" of checking if there was only a single byline of those in the group who was alive between the time the first and the last texts in this group were written and thus are the only possible author of the entire group, whereas any mere byline on a text cannot be taken to be truer than any other without such minimal biographical research.

If there are studies that compared every single "Shakespeare" text against a few otherwise bylined plays from these same decades; I would very much like to read them closely. Do include a link to them, and I will respond to their findings. I have never seen anything of this sort across all of my research into all available studies on the attribution of the British Renaissance.

Not all of the "Shakespeare" tragedies are different from the comedies; only most of the tragedies were ghostwritten by Percy, and most of the comedies were ghostwritten by Jonson. But there are exceptions in the minority in both of these categories. You can check the details of my comparison of "Shakespeare" texts in the "Shakespeare - Data Table - Structural Elements in Shakespeare" file on my GitHub.

There are several problems testing the word-frequency for so many words will create. If you compare the frequency of 1,000 of the most frequent words in a text, this list will probably include many single-occurrence words in a random order: some of these never occur in any other texts, and some will be relatively common in other texts. And as I have previously said, while the top-6 words form patterns of authorial-style that are easily distinguishable from each other (like a preference to use I or she/he or you); the comparison of all words creates chaotic outputs. For example, there might be a range between 0 occurrences and 500 occurrences for any given word; if the spike is a special name that only occurs in 10 texts about the same character; these texts will be matched as similar just because of this word appearing in so many of them and nowhere else. There are going to be similar glitches or just random frequencies of occurrences for all words. You are still not explaining your rule for how many occurrences of a word or of how many matching words it takes for two texts or a cluster of texts to be marked as similar vs. different. If the human does not know and cannot explain what a computer is doing when it is comparing texts; the resulting conclusions cannot be trusted because the method has not been explained other than to say that the computer "knows" what it is doing because it is artificially intelligent. This is not how scientific findings work. You cannot claim you have invented a method to establish the age of a painting through a special molecular testing method that is doing something the AI invented, and you have no idea what it is actually doing, but you are sure its restatement of known dates-of-creation is correct.

I understood the formula. You have not understood it if you are insisting that it is enough to explain how this formula alone can lead to any attribution conclusion.

"Take the relative frequency of all words. Turn them into Z-scores. Subtract. Divide by n." So, take 8 million words in a given corpus, each rated with its frequency in 284 different texts. So you have 8 million frequencies (or less depending on how you are defining "all words"); then, you are repeating yourself when you state to turn these into Z-scores, since by the average z-score you mean the frequency of a given word in a text. And "n" in your formula means "words". So you are basically describing how you find the frequency of a given word in a text vs. the the total number of words in a text. And you are suggesting that discovery the frequency of only this individual word is sufficient to establish the authorship of the text. The problem with your steps is that the comparative component of comparing even 2 texts to each other is missing in this formula. If you divide "tens of thousands" of words by the "standard deviation" you just get scores of their frequency per-text, not their comparative frequency against all or any of the other texts in the corpus. The audit would find fault in this broad mistake and not in the computer's ability to calculate deviation. What are you calculating "deviation" from? Are you counting a single Text A and its frequency of all words as the standard, and the Unknown Text X as the compared against element; are you only counting the frequencies in these two texts in isolation of the rest of the corpus? And if so, why are you ending up with trees that include dozens of texts? Have you compared all of them only against texts assumed to be by the same authors?

There is no difference between "publishing" a text and making it "available". For the purposes of copyright, posting texts online in-full to make them available for public consumption is equivalent to publishing. As I said, such re-posting is probably legally allowed, but it is unethical because you might benefit from increased traffic on your site for somebody else's editing/ transcription work.

Sure, the raw data from Analyze and other sites is included in this file on GitHub: https://github.com/faktorovich/Attribution/blob/master/Shakespeare%20-%20Data%20...

407faktorovich
mayo 20, 2022, 2:19 pm

>389 Petroglyph: It is indeed easy to create a file of relative frequencies between 2 texts, but it is impossible to reach an attribution conclusion merely on this list because it would be a nonsensical set of frequencies. There will be some words that do not repeat in the other, and many words that repeat at varied frequencies in both, etc. You still have not explained how this mass of numbers is evaluated to reach an attribution decision.

The complete incapacity of your method to test for attribution (as described by the formula you have provided) is entirely different from me not providing the parts of the calculation that go into adding up nouns-percentages. I provide the total number of words, and from this a mathematician can work backwards to divide it by the percent of nouns to determine the precise number of nouns Analyze counted in a given text. Same for exclamation points per 100 sentences. The total number of sentences in the text can be derived with these same programs, and working backwards you can determine from this the precise count of exclamations in a text. In contrast, you are not explaining what steps you are taking between calculating the frequency of each word in a text and reaching the final attribution decision. Obviously just counting the number of each word and comparing its frequency to its use or non-use in one other text is insufficient for any scientifically believable attribution.

408faktorovich
mayo 20, 2022, 2:29 pm

>393 Keeline: I explain Percy's dark humor in his subversively satirical tragedies in his self-attributed "A Forest Tragedy in the Vacuum: Or, Cupid’s Sacrifice" (1602): https://www.librarything.com/work/27242582 or https://www.amazon.com/dp/B09K26HMVD . Nobody has reviewed it on LibraryThing yet, despite dozens of people requesting a copy in the giveaway. This is a farce that is partially satirizing Percy's own "Shakespeare"-bylined "Romeo and Juliet" as well as his other overly violent tragedies like "Hamlet" (as I explain in the annotations); some speeches are extremely similar (if not plagiarized) between "Forest" and these "Shakespeare" texts, except they are exaggerated or taken to absurdity. Perhaps, asking me for a review copy and taking a look will solve this question more fully than pondering the hypothetical? Both "Much Ado" and "Othello" are by Jonson, and represent his unique formulaic preferences.

409Keeline
mayo 20, 2022, 2:39 pm

>371 faktorovich:

The texts I use are cited in my bibliography with their sources, so other users can upload them from the web to re-test them. It would take up too much data and it would be unethical for me to take an EEBO/ Gutenberg book, perform a basic pre-test edit, and then re-publish it on my own website. Even if these sites release public domain rights, it does not feel like the right thing to do to republish their work on my own website (as you have done when posting these texts in your link).

This expresses a fundamental misunderstanding of what "public domain" means. Here are a couple references to consult:

The term “public domain” refers to creative materials that are not protected by intellectual property laws such as copyright, trademark, or patent laws. The public owns these works, not an individual author or artist. Anyone can use a public domain work without obtaining permission, but no one can ever own it.

https://fairuse.stanford.edu/overview/public-domain/welcome/

The public domain consists of all the creative work to which no exclusive intellectual property rights apply. Those rights may have expired,\1\ been forfeited,\2\ expressly waived, or may be inapplicable.\3\

https://en.wikipedia.org/wiki/Public_domain

One of the Stanford projects is a database of U.S. works published between 1923 and 1963 whose copyright was renewed after the initial term of 28 years. If a work is not listed, searching a couple ways, then it was not renewed and it is public domain. It is hard to prove a negative but that is the situation we are presented with with the state of copyright laws and the data available. This database was the work of a Google engineer who gathered all of the copyright renewal data from the books of copyright renewals that Google had scanned. This was standardized and put in a giant XML document. The database (this and some others) make it easier to find data than doing a raw search of a massive text file.

For U.S. copyrights before 1927 (1926 and earlier) they are all public domain. For many copyrights that were not renewed in the 1927-1963 range, they could be public domain if not renewed.

Gutenberg and other book repositories post public domain works because they can. Yes there is work involved to produce and edit them. But the fact that they are public domain means that any limits they try to place on them is more of a courtesy than a legal requirement. Good scholarship says that you note the sources. But removing the legal block of text is completely permissible. You can also remove the preliminary pages, folio and chapters, and get down to the basic text.

If someone wanted to take Pride and Prejudice and intersperse scenes with zombies, they are perfectly allowed to do so. Indeed someone has. Pride, Prejudice and Zombies. It may change the entire character of the original and even if a Jane Austen estate or interest group didn't like it, there is nothing they can do about it legally. This rather old story is completely public domain in all countries.

Where one runs into problems is by taking modern books like the Harry Potter series which they may wish to compare with the pseudonymous work to see if J.K. Rowling wrote it. Sharing those texts is a problem. So those who make an analysis of the texts generally would not produce the stories in case the publishers felt it was competing with their ability to publish millions of copies of each title normally.

Since the U.S. public was paying scant attention to the issue, their representatives in Congress listened to pleas by publishers and media producers to extend copyright terms first in the 1970s and again in the 1990s. The first Hardy Boys books of 1927 would have been public domain in 1984 had it not been for these extensions.

Of course there are separate intellectual property rights such as trademark which may apply if certain rules are followed. Thus, when Steamboat Willie enters the public domain in a couple years, people can copy it freely and reuse it as they wish. They can even make derivative works from it (see the Klinger case about Sherlock Holmes). But Mickey Mouse will still be under trademark protection so long as the Walt Disney Company "vigorously defends" those rights so there's a limit on what people can do with this.

James

410Keeline
mayo 20, 2022, 2:44 pm

>408 faktorovich:

Is this an example of one of your "translations" ?

https://www.amazon.com/Hamlet-British-Renaissance-Re-Attribution-Modernization/d...

From the description you provided and the 2-star review from an Early Readers participant, there seem to be significant changes in this retelling of the story from the play, including new incidents, changed relationships, and different character names.

Is this the kind of text that you compare with others for your authorship attribution?

James

411lilithcat
mayo 20, 2022, 2:55 pm

>400 faktorovich:

I’m saying you don’t know. And neither do I. But I do know what is said about those who make assumptions.

412faktorovich
mayo 20, 2022, 3:03 pm

>399 Petroglyph: Bede and Occo are cited within my quote, so it is strange that you are introducing the fact that they originate with Bede as if you are contradicting what Verstegan (or I) stated in the quote. I explain the basis of this history in the Anglo-Saxon Chronicle and the other ancient sources across my annotations and introductions to the volume. The point I raised is that these are treated as unbelievable legends by Verstegan, but later historians have taken Verstegan's subversively satirical borrowing from these legends of the "Anglo-Saxon" term and are still applying it as a factual historically-proven descriptor of Britons. I am not saying that historians are unaware that "Hengist and Horsa" are legendary, but rather that they are unaware that the term "Anglo-Saxon" is based on this legend as these "Hengist and Horsa" were described as coming from tribes that included "Angles" and "Saxons" (in the region of the Duchy of Saxony in Germany), hence the origin of this term "Anglo-Saxon". Germany controlled or was the heart of the Holy Roman Empire for around 800 years leading up to the Renaissance, so they were motivated to spread propaganda that Britons originated from Germany, and thus Germany was the legitimate owners of Britain (even if they failed to act out this claim). By repeating the term "Anglo-Saxon" modern scholars are reinforcing this propaganda, without realizing it is indirectly confirming the verity of this "Hengist and Horsa" Briton-origin myth.

"Conflicting stories about founding figures does not necessarily mean there is no historical truth at all. But a careless reader and thinker might equate those two." Now you are yelling at Verstegan for being a "careless reader and thinker". And this makes me feel I am in the best possible company.

No, as I explain across "Restitution" as I add specific citations to Verstegan's vague citations: there are many mentions of Britons/Britain pre-Norman, such as Caesar's description of his time there as he conquered England/ landed there and built a fort (which archeologists have dated to his time in Britain). There are a few other mentions that Verstegan cites, and a few others that I have found. "Restitution" is 400 pages, and I am doubling the count with my comments, so there are a lot of explanations that cannot fit into this thread. You really have to actually read my series to what I am talking about.

As I explain in the introduction to the forthcoming BRRAM volume of Harvey's "Virtuous Octavia" play: "Jonson attained an advanced degree and a professorship at Gresham College later in life; so even early on, his ambition for higher-learning is the reason he ghostwrote most of the “Shakespeare”-bylined plays that borrow content from Plutarch’s 'Lives', including not only Antony and Cleopatra, but also Julius Caesar, and Coriolanus; Percy only participated by ghostwriting Timon of Athens" out of the "Shakespeare" plays that borrow their storylines from Plutarch. "Virtuous Octavia" itself is an entirely different retelling of the Cleopatra myth, wherein Cleopatra is the villain who steals Octavia's husband from her, leads the nation into war and then pushes Anthony into suicide by faking her own death. This is a great example how myth are told to sell propaganda that favors a position, and thus they describe the preferences of their authors, and not realities about what happened in the historic past.

Yes, I am interested in the historical truth, and I want history books to state this historical truth, and not to present myths as truth. Who wrote "Shakespeare" or all of the books of the British Renaissance is a broader question that fits into this category of separating actual facts, from what people have come to believe to be true but are actually false-myths.

413faktorovich
mayo 20, 2022, 3:42 pm

>409 Keeline: Something can be in the "public domain" and yet when it is "made available" to the public in a new place by a different party, this re-posting is equivalent to its new publication or re-printing. If the work is in the public domain and the publisher being copied from has released copyrights, this is likely to be legal, but as I said before it is not the ethical thing to do. The right thing to do is to only re-print sections only if one is making significant edits to them (such as modernization) or if one is adding annotations or an editorial introduction. If no changes are made, it is more ethical to just cite Project Gutenberg or EEBO as the source where the texts can be accessed by future researchers. Yes, it is legal to manipulate public domain works in any way (with some exceptions that would be illegal in all contexts). And yes, sharing Potter books would be technically completely illegal as well as unethical.

414faktorovich
mayo 20, 2022, 4:07 pm

>410 Keeline: You are agreeing with this 2-star Amazon review? It says that this reader did not like my book because it had 200+ annotations in it, which they thought was too brainy for them. And it makes several false claims that show this person did not read the book like "archaic terms still present in the text". This is false because I subtracted all of the "archaic terms" from the body of the text, and instead list them in the footnotes, where I define and explain these terms. And the correction - "The word 'toils' should of course be 'toil'" - is incorrect because I correctly matched the singular term "watch" with the singular "toils". In the summary, this reviewer states that a text either has to remain entirely in the original Early Modern English spelling for a scholarly audience, or must be extremely heavily edited to simplify all words and sentences to make them extremely easy to comprehend for a 7th-grade average reader: https://centerforplainlanguage.org/what-is-readability/#:~:text=The%20average%20.... This rule has led previous translations or non-translation re-prints of "Shakespeare" and other Renaissance texts, but it is an incorrect rule. The best strategy for translation of these texts is the precise method I have taken of translating only words that modern dictionaries mark as "archaic" or that are not even in modern dictionaries, while leaving all words that can be found in a modern dictionary as-is, even if they are not familiar to that average 7-grade-level American reader. It is not enough to only provide annotations for the most difficult words, while leaving the rest of the words in their original variant Early Modern English spellings.

As for your comments about this review, they are just a broken-telephone. "There seem to be significant changes in this retelling of the story from the play, including new incidents, changed relationships, and different character names." The reason the names and incidents are different in my translation of "Hamlet" is because it is a translation of the 1st quarto that was published in 1603 before the 2nd quarto was reprinted in the following year, 1604. Most widely-read modernizations, and reprint editions of "Hamlet" are based on a combination or one of the 2nd quarto or the First Folio 1623 (very similar to 2nd quarto) versions of this play. The 1st quarto has been called "bad" because it is very different, and probably largely because it does stress the homosexual relationship between Hamlet and Horatio, which is faded from focus in the edits that were made in the 2nd quarto and the 1st Folio (there are no expressions of interest for Ofelia/Ophelia in the 1st quarto from Hamlet, though Hamlet does appear to have slept with her, while continuing to tell her that he does not love her nor find her to be beautiful and does not want to marry her; in contrast, Horatio offers to kill himself when he sees that Hamlet is about to die, and Hamlet responds with equal affection, though he insists Horatio does not kill himself because of what this suicide would suggest). Thus, this "bad" 1st quarto has never been translated into Modern English before, and this is why the events and names in it sound so unfamiliar to readers. If it had been translated before I would not have included it in the series that only includes Inaccessible British texts. I did not rename characters from "Ophelia" to "Ofelia", nor did I insert a homosexual relationship between Hamlet and Horatio; I just reused the names as they were in the 1st quarto, and presented the events and dialogue as-is (cleaned up for the first time, so that they can be understood for what they were always saying by modern readers who could not understand the previously available old-spelling version).

No, I did not first edit the 1st quarto of Hamlet before testing it, and in fact, I tested the 2nd quarto in old-spelling and the 1st Folio version from a mainstream publisher in modernized-spelling. If you are unsure about what my series or books in it are about, maybe you should frame your statements as questions? Otherwise you are playing a broken-telephone with a review from somebody who has rushed through a text without reading enough to understand what quarto it was, or why it is so different etc.

415thorold
mayo 20, 2022, 5:25 pm

>412 faktorovich: “Jonson attained an advanced degree and a professorship at Gresham College later in life; so even early on, his ambition for higher-learning is the reason he ghostwrote”

That seems a bit misleading: it caught my eyes because I had never heard of anyone getting a degree from Gresham’s. Presumably you are talking about his honorary MA from Oxford, which isn’t really an “advanced degree” in the modern sense at all, and anyway seems to imply recognition of life achievements rather than aspirations to higher learning.

Judging by a quick Google search, the “professorship” is far from certain: there’s no record of any such appointment, but some people have conjectured that he might have been deputising for someone else when he gave Gresham’s as his address in a 1623 law case. (See e.g. here: https://www.gresham.ac.uk/sites/default/files/06feb95andrewgurr_benjohnson.pdf)

416Petroglyph
Editado: mayo 20, 2022, 6:01 pm

>412 faktorovich:
Germany controlled or was the heart of the Holy Roman Empire for around 800 years leading up to the Renaissance, so they were motivated to spread propaganda that Britons originated from Germany, and thus Germany was the legitimate owners of Britain (even if they failed to act out this claim). By repeating the term "Anglo-Saxon" modern scholars are reinforcing this propaganda, without realizing it is indirectly confirming the verity of this "Hengist and Horsa" Briton-origin myth.

What the fuck.

Are you denying that Germanic-speaking peoples migrated to Britain after the Roman retreat?

Why is everything a conspiracy theory?

Germany controlled or was the heart of the Holy Roman Empire for around 800 years leading up to the Renaissance

What the fuck? The Holy Roman Empire is generally taken to officially start around 960CE, and ended during the Napoleonic wars (1806? IIRC), when Napoleon dissolved it and dismissed the last Emperor. Depending on when you locate the Renaissance in northern Europe (Germany, low countries, Britain, Scandinavia) -- generally around 1500 -- the Empire was around for some five, six hundred years. At the most.

Germany controlled literally zero, because Germany did not exist. The Empire was a conglomerate of hundreds of little principalities.

tribes that included "Angles" and "Saxons" (in the region of the Duchy of Saxony in Germany)

The Saxons (and Angles, and Jutes and others) who migrated into Britain indeed came from what is now northern Germany and Jutland in present-day Denmark. But that happened in the 300s-400s.

The Duchy of Saxony, as a Duchy making up a segment of the Empire, was instituted in the twelfth century. The Duchy is separated from those fourth-century Saxons by forced christianization, a few near-genocidal campaigns by Charlemagne ("the Saxon wars"), the Carolingian empire, the breakup of the Carolingian empire and whatever other changes some seven or eight hundred years' worth of history bring. The Holy Roman Empire was separated from the pagan Angles, Jutes and Saxons by about six hundred years.

The Germanic-speaking Saxons who migrated to Britain in the fourth century may have occupied the same land that some eight hundred years later would be called the Duchy of Saxony, but those were two very different societies! Fourth-century Saxons were most definitely not from the Duchy of Saxony.

Holy shitballs, lady. There's ahistorical thinking, and then there's this.

By repeating the term "Anglo-Saxon" modern scholars are reinforcing this propaganda

Only in your way of thinking. Other people don't think like this.

As I explain in the introduction to the forthcoming BRRAM volume

Your kooky poppycock is worthless as an explanation.

a broader question that fits into this category of separating actual facts, from what people have come to believe to be true but are actually false-myths

Whatever makes you feel important and scholarly, I guess.

Fractally wrong. At every level. It's astonishing.

417faktorovich
mayo 20, 2022, 7:56 pm

>415 thorold: The sentence explains separately that 2 things happened 1. Jonson received an advanced degree, 2. Jonson received a professorship at Gresham College. He did not receive these 2 from the same place. As I explain in the forthcoming volume of my translation of Jonson's "Cavendish"-bylined "Variety" comedy, Jonson's handwriting matches "John Donne's" and "Donne" appeared in the rolls at Oxford when "Donne" was only 11 in 1583. Ben Jonson was also born in 1572, or was the same age. Then, "Donne’s" name shows up in Cambridge, but he does not receive a degree from either because of his Catholicism, and yet he manages to be accepted as a lawyer into Lincoln’s Inn by 1592. Given many other pieces of evidence I discuss in the extensive introduction, it is likely that Jonson used the "Donne" pseudonym during his actual college education before he was granted the honorary MA under his own name, so that he could receive a professorship at Gresham as himself, and without being weighed down by the Catholicism exceptions that blocked "Donne" from becoming qualified for a graduate degree to teach college. It is indeed also possible that he was teaching at Gresham under yet another pseudonym, but if he gave his address as being there this is the most certain proof possible that Jonson was indeed residing at Gresham and working there in 1623 and surrounding years.

418Petroglyph
mayo 20, 2022, 8:20 pm

In case anyone is wondering what reading Faktorovich's book is like, look no further than >417 faktorovich:

Paranoid historical fiction where literally anything is twisted to fit her kooky poppycock. There's precious little in the way of physical evidence: just her spinning her stories.

419faktorovich
mayo 20, 2022, 8:29 pm

>416 Petroglyph: "The Roman retreat"? The Holy Roman Empire = (approximately) Germany for 800 years, starting with Emperor Otto I in 962. These 800 years were when most of the Old English = Old German (and later Middle and Early Modern English) texts were written in Britain. The Angles and the Saxons are claimed to have migrated into Britain in around 477 or a few hundred years before Germany seized control of the Catholic Church/ Holy Roman Empire. This was a period at the end of the Western Roman Empire, and before this region was taken over by German kings in the Holy Roman Empire. Bede’s "Ecclesiastical History" is claimed to have been written in 731. The original "Anglo-Saxon Chronicle" is of an uncertain date of authorship, but edits of it are dated as having been expanded by two different scribes after the burning of the monastery of Peterborough in 1116, the first scribes adding entries “from 1122” until 1131, while the second scribe added entries “in 1154 or shortly thereafter.” The latter adjustments were made at the peak of the German Holy Roman Empire, while the Bede manuscript could have been back-dated to a preceding century, since it was only available in the archives and was not officially registered for publications in the years before print. Thus, most of these "Anglo-Saxon" myth for Britons' origins could have been created by Germany Holy Roman Empire scribal propagandists. Everything is not a conspiracy, but there is absolutely a conspiracy behind this German-origins for Britons myth.

Verstegan used the "800 years" figure, so I repeated it, as he might have had a reason to believe Germans had influence over the Empire in the centuries before they gained full control of it.

The last Holy Roman Emperors were still kings of Germany when it was dissolved in 1806. Germany was the Holy Roman Empire, as the region that was the Holy Roman Empire on the map changed its name to Germany (with some geographic adjustments).

There are also many principalities in Britain, France, etc. This has nothing to do with Holy Roman Empire being equivalent (approximately) to Germany.

If the Angles, Saxons and Jutes migrated to Britain in precisely 477 is exactly the myth at hand. There was constant migration among the different peoples of the European region, as well as between Eurasia (Russia) and further into the Middle East and regions across Europe. There were no clear borders across most of Europe across the years early centuries, and there was nobody stopping migrants (with some exceptions in mapped regions); archeologists and DNA researchers have derived maps of migration over the millennia that explains the flow of Indo-European languages. It is as ridiculous that all Britons came from the specific region of the Duchy of Saxony in Germany (where these Angles, Saxons and Jutes are mythologized to have lived), as it would be to conclude that all Native Americans came from Spain because Columbus landed there in 1492. Yes, precisely the "Duchy of Saxony" formed centuries after the 477 date, and yet the myth clearly glorifies this Duchy. Terming Columbus' visit as a "founding" of the "New World" was as much propaganda as claiming Britain belongs under German-Holy Roman Empire control because it was "founded" by Germans from Saxony.

420faktorovich
mayo 20, 2022, 8:31 pm

>418 Petroglyph: A book cannot be read in isolated segments. You have to read a full book to understand it. I am not afraid of history. You are afraid of my interpretation of history based on evidence that contradict the mythological history that you have come to believe to be truthful.

421prosfilaes
mayo 20, 2022, 9:18 pm

>413 faktorovich: If no changes are made, it is more ethical to just cite Project Gutenberg or EEBO as the source where the texts can be accessed by future researchers.

I can't speak for EEBO, but we who have labored for Project Gutenberg have generally been of the understanding that we are helping make these works available to the public, for whatever use the public wants to make of them. It is not unethical to copy a PG text to where ever you like; that's what they were made for.

422faktorovich
mayo 20, 2022, 10:51 pm

>421 prosfilaes: Sites like Gutenberg and EEBO receive funding based on the number of hits on their website, or if it is used by the public. By diverting traffic away from these sites and onto your own website, you are contributing to deflating their ability to make more works accessible to the public in the future. And this is unethical.

423Petroglyph
mayo 20, 2022, 11:37 pm

Oh dear. There's yet more levels of fractals to be wrong about.

I've had my own history americaplained to me before, and I've read my share of niche conspiracy theorists. But erm, "the Bede manuscript could have been back-dated to a preceding century" is a new one for me.

Faktorovich, do I wait until you bring up the Knights Templar on your own or would it be terribly rude of me to ask for a shortcut to that particular cul-de-sac?

Old English = Old German (and later Middle and Early Modern English

I pity your students so, so much. That is, if it's true that you've taught.

all Native Americans came from Spain because Columbus landed there in 1492

Columbus was Genoese. His financial backers happened to be the King and Queen of Spain. But Columbus himself was Genoese.

(hint: that's on a different Mediterranean peninsula than Iberia.)

archeologists and DNA researchers have derived maps of migration over the millennia that explains the flow of Indo-European languages

Pots are not people. Rookie mistake. Big mistake. Huge!

There are also many principalities in Britain, France, etc. This has nothing to do with Holy Roman Empire being equivalent (approximately) to Germany

*snrk* Ahahahahaha!

The last Holy Roman Emperors were still kings of Germany when it was dissolved in 1806. Germany was the Holy Roman Empire, as the region that was the Holy Roman Empire on the map changed its name to Germany (with some geographic adjustments).

A kindergartner couldn't have put it better. Though they might be better writers than "doctor" Faktorovich.

The latter adjustments were made at the peak of the German Holy Roman Empire, while the Bede manuscript could have been back-dated to a preceding century, since it was only available in the archives and was not officially registered for publications in the years before print.

Ah yes. That is an excellent example of conspiratorial-style thinking and hyper-motivated reasoning. Just like She is arrived and we were come are "Percy's linguistic quirk". Perhaps Old English, Chaucer, German and Dutch are merely honouring Percy every time they say things like sie war bereits gegangen instead of *sie hatte bereits gegangen and At night was come in-to that hostelrye Wel nyne and twenty in a companye. This goes all the way back to Saxon times! IT'S ALL CONNECTED

You are afraid of my interpretation of history based on evidence that contradict the mythological history that you have come to believe to be truthful

And we're back to the standard-issue conspiracy nut script. Go on. Call me a brainwashed sheep, a sheep in a herd of unthinking mainstreamers. Or a shill! You haven't pulled that one in a while.

424Petroglyph
Editado: mayo 20, 2022, 11:40 pm

>420 faktorovich:

A book cannot be read in isolated segments. You have to read a full book to understand it

How many dictionaries have you read front to back? Encyclopedias?

Books can be read exactly how readers want to read them. You may personally prefer for your self-published works to be read a certain way. But you're not in a position to make me. Or anyone else.

Anyway. There's no denying that your post 417 and your book were written by the same person.

425Keeline
Editado: mayo 21, 2022, 1:39 am

For what it is worth, the Wikipedia entry on Ben Jonson says:

On returning to England, he was awarded an honorary Master of Arts degree from Oxford University.

https://en.wikipedia.org/wiki/Ben_Jonson

James

426andyl
mayo 21, 2022, 4:56 am

>422 faktorovich:

You are factually wrong about PG. PG does receive some financial donations from individuals. However it works through volunteer effort - from scanning to proof-reading the OCRed scans all the way through to putting the finished texts on the PG website(s). It certainly doesn't receive funding based on number of hits.

From the mission statement

Project Gutenberg is not powered by financial or political power.

I am absolutely positive that PG would not think that having the texts of various books on a github site along with data-analysis software and results would be in any way unethical. They just don't care about that.

In fact they say

Project Gutenberg welcomes sites to mirror (copy) our collection.

There you have it. They are welcoming people to put up copies of their entire collection (or a subset if they wish) of books - thus removing loads of hits from their own servers.

427faktorovich
mayo 21, 2022, 1:18 pm

>423 Petroglyph: As I stated earlier, Bede and other early manuscripts need to be carbon-dated to determine the century in which they were created. The fact that nearly all of these history-defining manuscripts have never been tested and their dates remain with question-marks is a sign that the collections holding them have doubts about their current dating as well, and do not want to devalue their possessions, or stir a potential historical scholarship scandal.

Old English was pretty much the same language as Old German at a point before these languages diverged, or as English progressed into Middle and Early Modern English. Whatever mistake you are seeing is one that exists only in your imagination.

My analogy is not about the nationality of Hengist and Horsa, but rather about the nation from which they journeyed, Germany, into Britain before, as the myth suggests "discovering" Britain and then encouraging others from Germany to settle in Britain. It would be strange if the legend of Hengist and Horsa involved identical circumstances to those of Columbus and his "discovery" of the Americas. I had imagined that you would be able to distinguish the analogy from the history.

Pots? DNA analysis cannot be conducted on pots. It was conducted on people. Here is one of the articles that explains the science: https://www.amphilsoc.org/sites/default/files/2018-08/attachments/Reich.pdf

I covered the topic of the Northumberland archives (Percys' collection) and forgery in my latest review of: Don Ringe, "A Historical Morphology of English" (Edinburgh: Edinburgh University Press, 2021). It mentions that “glosses written in the Northumbrian dialect between 950 and 1000” were “the only substantial northern texts that survive from between the eighth century and c. 1300” registered “significant changes” in contrast with pre-8th century texts. This is relevant to my research because the Northumbrian estate belonged to the Percys, or William Percy’s family; in my translation of Jonson and Percy’s co-written "Variety" (1649), I pointed to several ancient documents from Northumberland’s collection that Percy plagiarized from; I have also matched Percy’s hand to interludes such as the “Ulpian Fulwell”-assigned "Like Will to Like" (1568) that Percy appears to have written in around 1587, but backdated it to an older time perhaps to increase its value. The presence of oddly-spelled Old English manuscripts in the Northumberland archives for a period for which there are no other texts in Old English across Britain hints that Percy or others in the Workshop might have forged these documents to appear ancient. Evidence that some of these documents might also be authentic, but that they were written much later than assumed includes the residence in Northumberland of the Mount Carmel (Carmelites), a hermit religious brotherhood that was brought into Britain in 1242 by crusaders who were returning from the Holy Land; these brothers first settled in Alnwick, Northumberland before moving elsewhere.

You are screaming about "sheep" and that you are afraid of a "conspiracy nut script". I am calmly presenting the evidence in support of my claims.

428faktorovich
Editado: mayo 21, 2022, 1:39 pm

>424 Petroglyph: I have read at least one dictionary cover-to-cover while studying for the SAT. And I am now translating a whole dictionary + history textbook via my translation of "Restitution".

429faktorovich
mayo 21, 2022, 1:37 pm

>425 Keeline: The story about Jonson's on-foot journey to Scotland is a ridiculous anecdote that was concocted by "Drummond" in "Heads of a Conversation betwixt the Famous Poet Ben Jonson, and William Drummond of Hawthornden, January 1619" (1711). The publication of this work 80 years after Jonson's death has led critics such as C. L. Stainer to describe it as a blatant "forgery" in "Jonson and Drummond: Their Conversation - A few Remarks on an 18th Century Forgery". Yes some critics such as Julie Sanders in "Ben Jonson in Context" have called this claim of forgery one of the "conspiracy theories", but responding to evidence of forgery by calling it a "conspiracy" is stronger evidence that a conspiracy is afoot, if no actual evidence is presented to counter the evidence that supports the likelihood of it indeed being a forgery. You have to look up the Oxford rolls to determine precisely how the MA was described when it was granted, as Jonson was frequently in jail over breaking sedition/libel rules in texts he wrote, so an "honorary" degree would have been revealing a hidden positive relationship between him and the establishment.

430andyl
mayo 21, 2022, 2:02 pm

>427 faktorovich: Old English was pretty much the same language as Old German at a point before these languages diverged, or as English progressed into Middle and Early Modern English. Whatever mistake you are seeing is one that exists only in your imagination.

Yes in about 490CE it was the same language - because it was brought here by Angles, Saxons and Jutes.

It started diverging from Old German in the late Anglo-Saxon period due to influence from Old (East) Norse.

431faktorovich
mayo 21, 2022, 8:40 pm

>430 andyl: I researched and reached a conclusion about the linguistic-origin earlier today, but since you are restating the Hengist-Horsa myth as if you have not read anything I previously stated on this point, I am not going to quote my findings here. They will be in the forthcoming BRRAM volume for "Restitution".

432andyl
mayo 22, 2022, 5:18 am

>431 faktorovich:

So what is the alternative? The population of most of England (except for Cumbria and Wales) decided to stop speaking Common Brythonic and start speaking Old English spontaneously? And that they all did it so as to develop slightly separate dialects? That they also decided to adopt Germanic legends and belief structures?

I am not suggesting that the Germanic peoples replaced the British people wholesale (although there were probably some migrations of Brythonic speakng peoples within England at the time), just that there were enough of them to assume positions of power and status such that the culture and language changed over time.

As for restating the Hengist-Horsa myth - I am not. There can be migration of Germanic people to England (who carried their language and belief structures) without any mythic stories (such as Hengist-Horsa) being true. In fact divine (or semi-divine) twin brothers forming a civilisation is a mytheme that goes back to prehistory. It is attested in a number of civilisations throughout Europe.

433Petroglyph
Editado: mayo 22, 2022, 8:52 am

>427 faktorovich:

Bede and other early manuscripts need to be carbon-dated to determine the century in which they were created

Leaving aside the fact that C-14 dating isn't really reliable for more recent periods (until +/- 500-600 years ago is the earliest); leaving aside the fact that you'd be testing the material the text was written on, and not the actual text; leaving aside the fact that autographs are very very rare; leaving aside internal and external methods of dating; and so much more besides.

Your paranoia and fundamental distrust of fields you don't understand is your issue to deal with. The world doesn't owe you a coddling.

Old English was pretty much the same language as Old German at a point before these languages diverged

Sigh. It's you using idiosyncratic interpretations of well-established terminology again.

The precursor dialects to the languages that we later would call English and German were dialects in the same West Germanic continuum at some point, yes. That is trivially true.

"Old German" is a term used by linguists to designate the form of the present-day language called German back when it became a recognizably entity separate from other languages. That is traditionally dated to around the 8thC, when a few sound changes undergone by old High German and Old Middle German had completed (Apple - Apfel). For English, that stage is called "Old English" about the 6th-7thC. The dialects that were taken into Britain were Old Low German dialects -- closer to the dialects that would later be called Dutch and Frisian and Plattdütsch.

So no. If you use the terms correctly, your statement is flat-out wrong. If you use them inappropriately, to vaguely refer to whatever ideas you have about this well-established set of terms (like people who call Shakespearean English "Old English"), you really should let people know.

My analogy is not about the nationality of Hengist and Horsa, but rather about the nation from which they journeyed, Germany, into Britain before

Same comment here: Germany did not exist back in the 6thC. There was no such nation then. You have this scenario in your head, but you use inappropriate terminology and concepts to convey it.

It's a feature of your writing in general, I must say.

Pots? DNA analysis cannot be conducted on pots

"Pots are not people" is an aphorism in archaeology that goes back to at least the sixties. It means that you cannot reliably trace the spread of prehistoric languages through archaeological artefacts alone. When one style of pottery spreads into a new archaeological horizon (say, the Corded Ware pottery into the Danube area and further into Europe), you just cannot know (based on only this kind of data) whether that means that the people living in central and western Europe at the time merely adopted those pots, or whether speakers of Indo-European settled new areas. You can, archaeologically speaking, trace the movements of physical objects and techniques and ways of producing and decorating items. But those movements do not necessarily imply immigration of speakers. "Pots are not people" is a reminder to stick to what the data tells you, and not to over-interpret what it implies. Tracing the movement of prehistoric languages is a hard problem.

That article you linked talks about how recent advances in securing and sequencing prehistoric human DNA has introduced solutions to this problem: From the first page of that article you linked: For the first time, we can trace movements of people and ask whether transformations in material culture in the past correspond to movements of people or communication of ideas.

Northumberland {...} Percy {...} plagiarized{...} backdated {...} forged

Paranoid conspiracy dreck.

You are screaming about "sheep" and that you are afraid of a "conspiracy nut script". I am calmly presenting the evidence in support of my claims.

And there we have it: that final note that makes you appear reasonable, along with a soupçon of "afraid of the truth". All part of the conspiracy nut script.

434Petroglyph
mayo 22, 2022, 8:49 am

>428 faktorovich:

I have read at least one dictionary cover-to-cover

Sure you have. It's irrelevant. The point was that you said this: A book cannot be read in isolated segments. You have to read a full book to understand it

There exist plenty of books that are not meant to be read in full. Starting at the first letter on the first page and ending with the final one is, of course, possible. But that is not the intent behind such books, and virtually no-one treats them like that.

Plenty of books are designed to be consulted in isolated segments. It'd be nice if you could acknowledge that.

435Petroglyph
mayo 22, 2022, 10:20 am

>406 faktorovich:

your method depends on a given text to be absolutely known to be by Author X

No, it doesn't. I said "reasonably certain". Stop putting words in my mouth. Your expectations of absolute certainty in the humanities are ridiculous. An impossible standard that allows you to always claim your expectations are unmet. It's a very lazy stick to beat a dead horse with.

it cannot either prove or disprove that the Unknown Text is either by the same author or not

It can give strong evidence for or against. The preponderance of evidence arising from multiple tests, together with external evidence is usually convincing to reasonable people.

And neither can your method.

when critics question if there was a real author called "Shakespeare"

That is a fringe conspiracy. It is not considered creditable by professionals, and the various fields that anti-stratfordians imagine they are up-ending are not set up to cater to those fake criticisms.

it requires that you know in advance which texts have established bylines, to attribute anything else to these bylines {...} as you have pre-determined that the byline of Author X is known and all texts that match it must also be by X

See, careful selection of meaningful parameters will reveal how coherent a putative body of texts is -- how likely it is that a number of texts ascribed to one author were actually composed by that person. That's part of what the tests do. If there are deviating texts, the tests will show that.

You're assuming that professionals work like you: assume the conclusion. That's not how this works.

This is a quantitative solution that is blind to the possibility of ghostwriting, pseudonyms and other realities of authorship

Your paranoia makes these things seem much more prevalent than they are. Especially in periods of time where things like copyright and originality had very different interpretations and valuations from today.

If there are studies that compared every single "Shakespeare" text against a few otherwise bylined plays from these same decades; I would very much like to read them closely. Do include a link to them, and I will respond to their findings. I have never seen anything of this sort across all of my research into all available studies on the attribution of the British Renaissance.

Thing is, "let's take the anti-stratfordians seriously and test the Shakespearean canon to their satisfaction" isn't really something the professionals do. Pseudoscientific poppycock like yours is not the guiding principle of the kinds of work professionals do. But sure.

Burrows, John. 2012. ‘A Second Opinion on “Shakespeare and Authorship Studies in the Twenty-First Century”’. Shakespeare Quarterly 63 (3): 355–92. Especially pp. 387 ff. He illustrates this with Romeo and Juliet. That section is mainly interested in collaborative plays of Billy Shakes with other people, so the discussion focuses on that, and compares his plays to other playwrights. But the other texts in the Shakespearean corpus do not stray further than half a standard deviation from the others.

If you want to see his graphs for the other plays, feel free to contact him.

Craig & Kinney, in Shakespeare, Computers, and the Mystery of Authorship pp. 20 ff perform one such test as an illustration (comparing the Shakespearean corpus with contemporary playwrights; they treat Coriolanus as an anonymous text to see whose cloud it will fall in). The other Shakespeare plays cluster together. Their chapter 8, which considers whether Shakespeare could have written the 1602 additions to Kyd's Spanish Tragedy, does a series of such analyses for Shakespeare vs Jonson, vs Dekker, vs Webster. The rest of the book contains similar texts for other collaborations. The non-collaborative Shakespearean corpus is fairly coherent in those tests.

Elliott and Greatley-Hirsch: "Arden of Faversham, Shakespearean Authorship, and 'The Print of Many'" (Chapter 9 in the New Oxford Shakespeare: Authorship companion) Especially pp. 159 ff, where they perform a series of tests comparing the Shakespearean corpus to contemporary playwrights.

Like, the Craig & Kinney book is kinda famous in these circles.

Of course, given your track record, you're just going to reject these out of hand or claim you have already disproven such methods. Or you're setting deliberately impossible standards. When you say "I have never seen anything of this sort", you are leaving enough unsaid that you can always find the space to claim "well, that was not really what I was asking for, so I'm gonna discount it".

I shouldn't have bothered, really. Too late now.

There are several problems testing the word-frequency for so many words will create.

See? You're rejecting not-your-method already!

436Petroglyph
Editado: mayo 22, 2022, 2:19 pm

>406 faktorovich:

There are several problems testing the word-frequency for so many words will create.

"Problems" that you imagine based on your misunderstanding of this method. Deliberate or otherwise.

If you compare the frequency of 1,000 of the most frequent words in a text, this list will probably include many single-occurrence words in a random order

... So are the words compared the most frequent ones? Or do they occur only a single time? What's the length of the text you're imagining?

This is a non-existent problem born out of confusion.

the top-6 words form patterns of authorial-style that are easily distinguishable from each other

Apophenic patterns.

For example, there might be a range between 0 occurrences and 500 occurrences for any given word; if the spike is a special name that only occurs in 10 texts about the same character; these texts will be matched as similar just because of this word appearing in so many of them and nowhere else.

No, this is not true. Character names are, obviously, not counted towards similarity scores. If you'd truly understood these methods that you claim to have criticized and debunked, you'd know this.

But I can see how character names would be a problem for your paste-the-entire-text method that is incapable of advanced corpus preparation. If you were to exclude all character names, you'd have to do so manually. If a character name turns up in the list of "shared words" between texts, then you have no way of excluding that word in a way that is not prohibitively labour-intensive.

But that's a shortcoming in your method. Computer-assisted methods have come a long way.

You are still not explaining your rule for how many occurrences of a word or of how many matching words it takes for two texts or a cluster of texts to be marked as similar vs. different.

The rule is: more similar to each other than to other texts in the same corpus, as calculated over corpus-wide metrics for whatever test you're using. MFW, n-grams, series of punctuation marks, function words, subsets of function words or content words, geographical markers (colour vs color), sentence length, typical words/idioms/trigrams/..., words/idioms/trigrams/... usually avoided by this author... To a statistical degree of significance. Instead of working with a pre-determined set of criteria to separate two (or more) authors, or to assign a text to one out of a group of authors, the relevant criteria are usually selected on the particular corpus you're working with: how do these groups of texts in particular differ, and how does the mystery text fare when tested against those relevant criteria.

If the human does not know and cannot explain what a computer is doing

Given your shoddy understanding of computers (and text encoding, and opening .txt files with a spreadsheet, and citing html use as part of your programming cred), your standards for "knowing what a computer is doing" are clearly irrelevant.

That formula you struggle so mightily with explains the steps that the computer takes.

Your lack of understanding does not mean that something is nonsensical. Other people have mental lives, too. They are different from yours.

I understood the formula.

You clearly don't. If you did, you wouldn't ask the questions you've been repeating.

you are repeating yourself when you state to turn these into Z-scores, since by the average z-score you mean the frequency of a given word in a text {...} If you divide "tens of thousands" of words by the "standard deviation" you just get scores of their frequency per-text, not their comparative frequency against all or any of the other texts in the corpus.

You don't know what a Z-score is. If you did, you wouldn't imagine these confused non-problems.

Take the relative frequency of a word in one text (say, the), subtract the mean (the average of the frequency of the across all texts in your corpus), and divide the result of that by the standard deviation (also calculated over the entire corpus). that is the Z-score of the in that text.

Now you have a measure of how "typical" or "deviant" that text's use of the is, as measured against your corpus. You do this for all the top 100 or 500 or 2000 or whatever words in your corpus, in every single one of the texts. Now you can calculate how similar or how different these texts are to each other.

You can do the same thing for character and word n-grams. The computer doesn't care what it calculates the Z-score of.

And you are suggesting that discovery the frequency of only this individual word is sufficient to establish the authorship of the text.

No, of course not. What an ignorant thing to say.

The software performs the same operations on every single word of whatever range you've specified (top 150, top 500, top 2000). The formula ranges from i to n. From the first word to the n-th. That's what those figures below and above that big Sigma indicate.

It is you who writes books claiming that the absolute frequencies of the top six words form "PAtTErnS OF aUThorIaL-STyle ThAt arE EAsIlY dIstINguIshAbLe FroM EaCH otHeR". Top six. That's your thing. Nobody else's. And it's garbage.

Your shortcomings are not shared by everyone. Other people have mental lives, too. They are different from you.

I understood the formula.

You clearly don't. You say you do, but you don't. It's obvious.

The problem with your steps is that the comparative component of comparing even 2 texts to each other is missing in this formula

... no? Instead of the authorial bundle k (which has the averages of all the words in all the texts assigned to that author), you just... use a second text?

You apply the formula to every possible pairing of texts in your corpus. Of course. Why wouldn't you?

What are you calculating "deviation" from? Are you counting a single Text A and its frequency of all words as the standard

Deviation from the scores as normalized against the corpus. If you have to ask this, you haven't understood the formula.

And again, what you are describing here is your rubbish method. You mark ~18% of your entire corpus as "similar" to a single text. You then repeat that for every single text in your corpus.

why are you ending up with trees that include dozens of texts? Have you compared all of them only against texts assumed to be by the same authors?

No. These texts have all been compared against the corpus-wide standard. The software compares any of these dozens of texts against all the others. And since they all have been scored in terms of how similar or deviant they are from the corpus-wide standard, it's easy to compare all these texts to each other: the basis on which you do so is constant across all the texts.

If you understood these methods, you wouldn't ask the ignorant questions you've been repeating. The fact that you even ask them is evidence of your non-understanding.

Or you're deliberately smearing not-your-method with nonsense, in hopes of sounding like a critical scholar.

Probably both. A lot of both.

Faktorovich, you do not understand the methods that the professionals use. You fail to grasp how computationally-assisted methods work. Your misunderstandings lead you to imagine problems that do not exist in reality, only in the misunderstood fiction you have concocted in your head. You then "criticize" people and methods based on non-existent issues with a fictional method.

This is why no-one takes you seriously.

437faktorovich
mayo 22, 2022, 11:50 am

>432 andyl: Well, simply-put, this article - https://www.cambridge.org/core/journals/antiquity/article/div-classtitlethe-peop... - indicates that it is very likely German-Dutch people had a mass-migration into Britain at around the time Otto I became the first German-king Emperor of the Holy Roman Empire in 962. The Germans brought their Old German, which is thus identical to earliest form of Old English, and this language then changed into Middle and Early Modern English between that point and the Renaissance. Archeological evidence indicate that some Romans stayed in Britain after Caesar built a first Roman fort in around 43 AD. This explains why there were Greco-Roman pagan religions worshipped in the British Isles (though some of the names appear to have been altered and some myth were altered through a long separation from the continent and Roman culture). And there were some ancient hunter-gatherers that had been left in Britain when Doggerland drowned 6000 years ago; but there were not enough of them for them to register as a separate group on DNA tests. The Germans clearly wanted to establish an ancient history of their ownership of the British Isles when the threat of a Norman invasion came up (and before they actually invaded and confiscated the land into their feudal system starting in 1066, or only 3 generations after the Germans had settled in), so they probably publicized the Hengist and Horsa myth at that time that claimed they had arrived back in 477.

438faktorovich
mayo 22, 2022, 12:02 pm

>433 Petroglyph: Otto I was King of Germany and Holy Roman Emperor, so at the time when I am saying Old German was brought into Britain, there was definitely a place called Germany that was also the Holy Roman Empire. I have not been making a distinction between High and Low German because this distinction is relevant when discussing regions in Germany, and not the use of Old German in Britain.

439faktorovich
mayo 22, 2022, 12:07 pm

>434 Petroglyph: The reviewer of my "Hamlet" translation failed to grasp the subtitle "First Quarto" was significant, and failed to read the "Introduction" where this detail was explained before reaching the false conclusion that I (and not "Shakespeare"/Percy) had alternative name-spellings and a homosexual plot in this play. If a reviewer is making false statements across all parts of a review because he or she has failed to read anything but the isolated fragments they are quoting from (and even those they have not read closely enough to grasp the grammatical details); then, they are not doing their job as a reviewer, but rather issuing a hateful tirade of nonsense. Readers can read or not read whatever they want, but reviewers who write about what they read have a responsibility to at least read what they are going to be commenting on.

440andyl
mayo 22, 2022, 1:06 pm

>438 faktorovich: Otto I was King of Germany and Holy Roman Emperor, so at the time when I am saying Old German was brought into Britain, there was definitely a place called Germany that was also the Holy Roman Empire.

Otto I was the Holy Roman Emperor from 962. He was King of East Francia and Duke of Saxony from 936. The Holy Roman Empire was larger than East Francia. East Francia included more than modern Germany as well.

But you are saying that before those dates the only languages in England were Common Brythonic and after about 800CE Old Norse (I guess). This is just nuts.

441Petroglyph
mayo 22, 2022, 1:25 pm

>440 andyl:

This is just nuts

Off the deep end. No wonder she wants to carbon-date manuscripts that don't fit with her revisionist garbage.

442faktorovich
mayo 22, 2022, 2:27 pm

>435 Petroglyph: "Careful selection of meaningful parameters" (word-frequency for 1 or 2 texts) "will reveal how coherent a putative" (assumed) "body of texts is -- how likely it is that a number of texts ascribed to one author were actually composed by that person." (You are not testing if "known" texts "by one author" "were actually composed by that person": you are just assuming this to be a fact. This is the problem with your method, which you are working to avoid acknowledging.) "That's part of what the tests do." (No, your tests depend on some texts being of known authorship, or there is nothing to compare a mystery text against to check if it is similar or different.) "If there are deviating texts, the tests will show that." (No, your test would only show that a text is not like the "known" text, and not that the "known" text has been assigned the wrong byline.) "You're assuming that professionals work like you: assume the conclusion." (You are assuming the conclusion when you restate current bylines in your diagrams, without showing what steps in data processing led to this miraculous re-affirmation of a "conclusion" you had "assumed" or expected to be true.)

I mention another one of Burrows' essays in Volumes 1-2, but not this 2012 Burrows article you mention, "A Second Opinion". I reviewed it just now to check what you are referring to. I started noticing attribution problems earlier in the article, but turned to page 387, so that we are looking at the same points. This page supports my argument more than it does yours. It is referring to Figure 4 on page 386. It is indeed one of the rare diagrams that shows 68 different Renaissance plays, instead of only a handful. And yet it only shows 28 "Shakespeare" plays out of at least 37 plays currently attributed to "Shakespeare"; there can be a selection bias towards Percy's plays (if it mostly includes tragedies), while Jonson's plays can be in a statistical minority. This would explain why most of the "Shakespeare" plays cluster near non-"Shakespeare" plays, but some are further away (as the others might be Jonson's otherwise bylined works). If all of "Shakespeare's" plays are not tested, the combination of tested plays can lean towards Percy or Jonson and thus show a difference between "Shakeseapre" and not-"Shakespeare", when it is really only a difference between mostly Percy and mostly Jonson ghostwritten play between the slightly varying clusters. For example, the two specifically named plays in this diagram "Titus" and "3 Henry VI" were both ghostwritten by Percy, and they are very close to each other on the diagram. And with most of the plays from these decades ghostwritten by either Percy or Jonson, a diagram of most plays shows only the slight divergences between these two ghostwriters' styles, and they are intertwined as they frequently co-wrote plays. If there had been many different authorial signatures in this diagram, there would have been a random separation of styles across different quadrants of the diagram, but they are all clustered near each other in a blob, just as they would be from primarily two dominant co-writing hands.

Just as I assumed the 28 "Shakespeare" plays are likely to be light on Jonson as Burrows comments: "the Shakespeare plays that emerge as outliers are the subgroup of four located above and to the right of his main group—Coriolanus, The Merry Wives of Windsor, and the two earliest comedies." Both "Coriolanus" and "Merry" were ghostwritten by Jonson, so since there are fewer Jonson "Shakespeare" plays that were tested they have registered as "outliers" from the mostly Percy-signature "Shakespeare" plays.

Burrows also specifies by-name: "Titus Andronicus and 3 Henry VI lie between the Shakespeare and non-Shakespeare sets. Their nearest neighbors to the left are Romeo and Juliet and Henry V. To the right are Peele’s Edward I, Greene’s James IV, and Marlowe’s Edward II." Here I am starting to suspect that the numbers were manipulated, and not merely that the corpus of tested plays was manipulated. Percy ghostwrote "Titus", "3 Henry VI", "Romeo", "Henry V", and "James IV". But Sylvester ghostwrote the "Peele"-bylined "Edward I", and he is likely to have also ghostwritten the "Marlowe"-assigned "Edward II" (though this is the one text out of this group that I did not test). This critic appears to have assumed all of the monarch-tragedies were by the same hand and to have clustered them together without testing if "Edward I" was indeed unlike the others. You can see the significantly different measures "Edward I" has in my 27-tests vs. the measures the Percy or Jonson-ghostwritten plays have. How Burrows and those he is citing arrived at the diagram visualizations and summaries of their conclusions cannot be determined because none of these researchers share their raw data or "analysis", so it can be entirely non-existent, and they can just be drawing pretty pictures that match what they assume or know from current bylines the attributions should be.

In summary, Burrows' "A Second Opinion" does not present anything different from the articles by him and others I have reviewed before. He does not even compare all 37 "Shakespeare" plays, and he does not list the 27 plays that are compared to allow auditors to check if there has been selection-bias in how they were chosen (such as a preference for tragedies over comedies).

Both "A Second Opinion" and "Shakespeare, Computers" lean towards an approach of comparing the occurrence only of "hapax legomena" or words that occur only once in a text, or the most rare words in any given text against their appearance or non-appearance in other texts. This is a chaotic approach to testing that appears deliberately designed to avoid rational attribution findings, as many of these words are likely to appear only a single text, while the rest will appear in nearly every text even if they are constantly only used once-per-text. This is what page 20 in "Shakespeare, Computers" covers. They add that the checked for the occurrence of these words in all "Shakespeare" texts against all of the non-"Shakespeare" texts in their corpus. This could only lead to an absurd conclusion as most of the plays during these decades were ghostwritten by Percy and Jonson, so they ghostwrote most of the "Shakespeare" and non-"Shakespeare" plays, with some help from Sylvester, Harvey and Verstegan with a small number of non-"Shakespeare" plays.

Then, "Shakespeare, Computers" absurdly finds that the word "gentle" appears a lot more in "Shakespeare" than outside of this byline. I searched for the word "gentle" inside of my 284 texts, and 278 of them had the word "gentle" appearing in them at least once. So, it is absolutely impossible that it only appears in "69 percent" of the "Shakespeare" segments and "55 percent" of the "other segments" as these authors claim. Their cut-off word is "heaven" appears in 143 texts, but this is only because in the other 167 texts it is spelled as "heauen" (there is some overlap between these where in a single text both the "heaven" and "heauen" spellings are used); so basically, "heaven" appears in every text in the corpus at least once. This proves why these common but not the most-common words cannot be used to establish attribution, as their mere presence is not indicative of an authorial style, as they are simply common words all authors in this period used at some point.

They also claim that the word "yes" is "rare" in "Shakespeare", but common in their otherwise bylined segments. "Yes" appears in 211 of my 284 texts. These include "Shakespeare" and non-"Shakespeare" texts without a distinction. For example, in "3 Henry IV": "Yes, I agree, and thanke you for your Motion." And in "2 Henry IV": "Yes my good Lord, Ile follow presently." And despite this, these guys are using "yes" as a defining non-"Shakespeare" feature.

And selection-bias enters in as a problem because these researchers are selecting sections out of these texts of 2,000 words or less, and they are not specifying which sections they are choosing, so they can say they chose a section without these words if questioned how a word like "heaven" that is in every text happened not to be in some of the fragments they tested for attribution.

In Elliott and Greatley-Hirsch's "Arden of Faversham", page 159 is comparing "Kid" to non-"Kid" plays. In EEBO, the "Kid" byline only appears on a single text out of the two texts it has assigned to "Thomas Kyd", a non-fiction pamphlet called, "The trueth of the most wicked and secret murthering of Iohn Brewen, goldsmith of London": https://quod.lib.umich.edu/e/eebo2/A72503.0001.001?rgn=main;view=fulltext - the byline states that it was printed for "John Kid" (not "Thomas Kyd"). "Spanish Tragedy" was anonymous when it was published, and was only later assigned to "Kyd" because a "kid" was said to have written "Spanish" in a contemporary critical note. "ARDEN OF FEVERSHAM" is considered as anonymous even in EMED: https://emed.folger.edu/arden . Though there is one play that does have "Thomas Kid's" byline on it as the translator, "Cornelia": https://quod.lib.umich.edu/e/eebo2/A01501.0001.001?rgn=main;view=fulltext . Since "Cornelia" is a translation in an early cluster, it is likely to have been ghost-translated by Sylvester. Whereas the anonymous original "Spanish" tragedy, according to my 27-tests was ghostwritten by Percy. And the anonymous "Arden" was ghostwritten by Jonson. So, checking, as Elliott and Greatley-Hirsch do, "Cornelia" and "Spanish" as 2 "Kyd" plays and comparing them to "Arden" as a mystery text, and against non-"Kyd" plays (such as "Marlowe's" "Jew" and "Paris", which were both ghostwritten by Percy) is an absurd exercise that can only present nonsensical results if the data had been fully disclosed. 49 of my 284 texts had "murdered" in them (including Percy's "Kid"-assigned "Spanish" as well as his other plays such as the "Shakespeare"-bylined "2 Henry IV"), while another 32 texts spelled this word instead or additionally as "murthered" (Verstegan, Harvey, and Sylvester prefer this spelling, though they also occasionally use "murdered"; Sylvester uses it in his "W.S."-bylined "Locrine"). Thus, searching for only "murdered" and not the alternative "murthered" spelling prefers the Percy-ghostwritten plays over those created by these other ghostwriters. They conclude that "Arden" falls into the not-"Kyd" cluster or that it is unlike "Cornelia" and "Spanish"; since my tests classify "Arden" as Jonson's, "Spanish" as Percy's, and I can guess that "Cornelia" is Sylvester's; I would agree that "Arden" is not created by the same linguistic signature as either of the two "known" "Kid" plays, but neither are these two "Kid" plays written by "Kid", but they are rather the work of the before-named ghostwriters.

If you were actually capable of launching a defense for the method(s) these authors are pitching, you would address every one of my criticisms point-by-point. But instead you tend to toss insults at me, with generalized outcries of discontent, while you cite new authorities that make the same types of problems in their attributions. These preceding attempts at attribution have not worked because their method(s) are faulty. My method is the correct, rational, and fully data-disclosing method that has derived the accurate attributions for 284 Renaissance texts.

443faktorovich
Editado: mayo 22, 2022, 2:59 pm

>436 Petroglyph: "Z-score is... Take the relative frequency of a word in one text (say, the)," (944 occurrences in Percy's "Shakespeare"-bylined "2 Henry VI" that has 26,134 words = 944/26134? = 3.6% of all words) "subtract the mean (the average of the frequency of the across all texts in your corpus)," (This is likely to be a pretty similar figure as "the" occurs at around the same rate in all texts. For example, the anonymous "Richard III" has 746 occurrences out of 20,525 words = again 3.6%. So subtracting it from the other measure gives you 0) "and divide the result of that by the standard deviation (also calculated over the entire corpus)." (The "standard deviation is calculated by adding up all of the frequencies across the corpus - which can be done with this calculator - https://www.calculator.net/standard-deviation-calculator.html . If all of the "the" percentages are 3.6%, then the "standard deviation" between them is 0. So you are saying take 0 from the previous measure, and divide it by 0 to come up with... 0) "that is the Z-score of the in that text." (Fantastic, the "Z-score" is 0 for both of the texts I have tested so far. Does this mean they share an author, or not? It might mean that Percy has a very similar percentage of the word "the" across his texts, or it might mean that most authors have approximately 3.6% of the word "the" in their texts. Your Z-score cannot possible be practically useful for attribution, and it would only become more chaotically nonsensical the more words and texts are compared with this obviously useless measure.)

When the computer is processing these calculations, you might not realize that it is comparing near-identical scores, or that some words are so rare they only appear in a few words on a given topic. If you stopped to analyze the steps in this process, you should have noticed this method is unsuitable for authorial attribution.

When I mark 18% of texts as similar on a single test, this is not a conclusion about its authorship, but rather only 1 out of 27 tests. For texts to be classified as similar in authorial style, they have to match on at least 10 of these tests, or they have to be within 18% of each other on 10 different measurements.

444faktorovich
mayo 22, 2022, 3:09 pm

>440 andyl: Without carbon-dating manuscripts claimed to have been written in Britain prior to 962, it is impossible to settle the question with certainty one way or the other. I present other evidence across the book to support my stated conclusions. "Old German" was used in the territory of which Otto I was King and then Emperor; and the name or contents of this language is relevant to my study of the use of Old English/Old German in early British texts, and not how the people of Germany called themselves in 962. I did not make any claims about "Common Brythonic", or "Old Norse". The DNA analysis I cited earlier found that modern British people are all descended from Germans and Dutch dating back to 962 or so, without any other major groups in this mix (such as "Vikings" or "Old Norse"). So, your objections regarding other language variants should be put to the authors of that study. Feel free to email them with your deep comment: "This is just nuts."

445Petroglyph
mayo 22, 2022, 3:35 pm

>443 faktorovich:

If all of the "the" percentages are 3.6%, then the "standard deviation" between them is 0. So you are saying take 0 from the previous measure, and divide it by 0 to come up with... 0)

So. You assume that the frequency for one word is exactly the same for all texts in your corpus (3.6%). You assume that the standard deviation is zero.

I guarantee you that the frequency of the across your corpus is not a round 0.036. You really should put more effort in double-checking things.

divide it by 0 to come up with... 0

You divide by zero, and the result you think you get is zero.

I don't even have to ask you to sit down and do the actual calculations for a couple of texts: you'll just barely look at your own data and find an excuse -- any excuse at all, including dividing by zero -- to not do any checking. Absolutely useless, you are.

Fantastic, the "Z-score" is 0 for both of the texts I have tested so far. Does this mean they share an author, or not?

You look at two texts. Assume nothing is different for the others. And on the basis of gob-smackingly stupid maths for a single word in two texts (dividing by zero = 0 !!!) you give up and declare the whole process not practically useful for attribution?

That's low-effort, Faktorovich. Even for you.

446Petroglyph
mayo 22, 2022, 5:37 pm

>442 faktorovich:

Lies and incompetence. You're not worthy of my efforts.

You are not testing if "known" texts "by one author" "were actually composed by that person": you are just assuming this to be a fact

I am not. In plotting all of a purported author's texts on a graph in contrast with other likely candidates for the same texts, I am, in fact, testing all of those texts. If they cluster together, it's not because they have been declared to be by the same author, but because the results for the relevant tests cluster them together. The clustering together is the outcome of the test, not the pre-ordained assumed conclusion.

You are not testing if "known" texts "by one author" "were actually composed by that person": you are just assuming this to be a fact. This is the problem with your method, which you are working to avoid acknowledging.) (emphasis added)

It is not. This claim that "fakespearean 'scholars' are just assuming that Shakespeare was real and that's why they're wrong" is a cornerstone of every single anti-stratfordian garbage-seller. It is something every single one of you claims or at least implies. It is an article of faith amongst you anti-stratfordian conspiracy theorists, and no amount of showing you the tests is ever going to convice you. That is what makes you a conspiracy nut, Faktorovich.

That, and your habitual lying and unwarranted rewriting of broad swathes of history just to fit with your little pet theory.

You're no different from Emmerich, or Looney, the baconians or any of the group theorists.

No, your tests depend on some texts being of known authorship, or there is nothing to compare a mystery text against to check if it is similar or different.

I, and, more to the point, Burrows and Craig and all the real professionals, test a mystery text against works by authors who are claimed to have written the mystery text. Those tests are clustering tests -- designed to detect relevant differences across texts and group together likely candidates. If there is any multi-authorship in the subcorpus that represents one purported author's work, then the clustering software will pick up on that. The clustering software compares any pairs of texts. All possible pairings in the corpus. Every single one.

Let's say we have a bunch of texts by Shakespeare, Marlowe, Peele, Nashe and Jonson. And one text that could be written by one of these five. The software isn't just testing the mystery text. It's testing every single text in the corpus at the same time. The same tests that check the degree of similarity/dissimilarity between the mystery text and all the other texts in the corpus, checks the similarity between any two texts in the corpus. If any of Shakespeare's texts in the corpus were by Jonson, the very same tests that try and assign the mystery text would assign them to Jonson.

Your entire "discussion" of Burrows boils down to "Faktorovich's garbage method shows different results" and assumptions that Burrows fudged the data. You do not understand that his graphs are the result of all possible pairings.

Your approach is intellectually bankrupt. Empty verbiage and empty accusations is all you've got.

Both "A Second Opinion" and "Shakespeare, Computers" lean towards an approach of comparing the occurrence only of "hapax legomena"

That is a straight-up lie. You're a liar, Faktorovich.

From Craig and Kinney's book, p. 91:

Its base data are counts of the lexical words that appear regularly in a group of texts (here, Shakespeare’s plays, divided into segments) but only rarely in a second group (segments from plays by other writers dated by the Annals of English Drama between 1580 and 1619), and counts of the complementary list of words rare in Shakespeare but common in the other group.

That is mentioned passim, in that book, by the way. They don't use hapax legomena. they use the words that are rare in one group but comparatively frequent in the other. The more typical words for either group.

You've either not read this book, or you have and you're just making it up.

"Gentle" is more common in Shakespearian words; "yes" in Shakespeare often means an affirmative yes, as in, extra emphasis, or as in response to a negative (compare si in French, or jo in Swedish). In other playwrights "yes" is more commonly uses as just "yes"; Shakespeare would more typically say "yea" in those cases. Comparatively, that is. Not in every single instance. But more often than texts written by Marlowe and Jonson and all the others.

This is explained at length in the book. It is impossible for you to have actually read this book and not know this.

Your "discussion" of this method rests on a fundamental lie, and can, ipso facto, be rejected. A lie of having read the book, or the lie of misrepresentation. I don't care which one. You cannot be trusted either way.

You're an incompetent who divides by zero, fails to grasp the notion of a Z-score, yet who thinks they can criticize statistical approaches to natural language processing and data science.

447Keeline
mayo 22, 2022, 7:39 pm

>443 faktorovich:

This is likely to be a pretty similar figure as "the" occurs at around the same rate in all texts. For example, the anonymous "Richard III" has 746 occurrences out of 20,525 words = again 3.6%.

Just because you can count something like the exceptionally common word "the" and compare it with the number of words in a text does not mean it is significant to author attribution. Here is a case in point. These are counts of the word "the" (your sample word) in a corpus I have handy with 39 books. The spreadsheet shows the number of occurrences, the words per text, and the percentages. The usage of "the" floats over a range of 5.24% to 7.73%.

The variation of this does not correspond with the authorship based on contracts and correspondence.

Counting the common word "the" is not relevant to an authorship attribution. It is in every single one of the texts between 782 and 3335 times. If we exclude the two short books, the range is still only 2111 to 3335.

We could pull out, as you do, the six most common words that are not proper names (like "Tom") from the corpus as well. They are, for the record (with counts of the appearances in the 1,608,052 word corpus with 21,111 distinct words):

100394 the
44923 to
42243 and
37077 of
35877 a
32942 i
26049 he
25939 tom
25025 it
23602 in

I left "Tom" in there so one can see where it ranks amongst the most common words.

By the way, since my texts came from a couple sources, I found some anomalies. Some Gutenberg editors transform an em-dash to two hyphens (--) and others make it an em-dash character (—). There were some cases where italics from the original were marked off with underscore characters (_) and these had to be corrected to get proper counts.

Likewise, when grabbing counts of "the" I had to be sure that it was not finding all words that start with "the" or have "the" in the middle.

100394 the
 15 theater
 7 theaters
 2 theatre
 3 theatres
 18 theatrical
 2 thee
 30 theft
 2 thefts
3013 their
 13 theirs
 1 theirselves
4279 them
 255 themselves
4413 then
 12 thence
 1 theodore
 2 theoretical
 2 theoretically
 11 theories
 94 theory
 2 ther
7536 there
 6 thereafter
 13 thereby
 3 therefor
 30 therefore
 2 thereof
 1 thereto
 28 thereupon
 11 thermometer
 1 thermometers
 4 thermos
 3 thermostat
 923 these
 6 thet
10827 they

On this list you can see that "the" appears 100,394 times in the 39 texts. Next in line of these are "they" (10,827) and "there" (7,536).

None of this is particularly surprising considering the structure of English used in a narrative text.

Potentially more interesting would be to see the contexts of (theater / theatre) and (theaters / theatres). Are the "re" endings specific to a dramatic setting or is it a word choice?

What about "therefor" and "therefore"?

Words like "thence" and "thereby" are a bit formal for a juvenile adventure story but the writers did not write down to their audience.

Is the distinction between (clew / clue) and (clews / clues) an editorial difference or an author's word choice? And if neither, did a PG transcriber normalized the spelling?

% egrep "clue|clew" allu
 66 clew
 23 clews
 64 clue
 28 clues

Both words were considered synonymous when the series was published. But one publisher, Street & Smith, had a publication with "Clue" in the title and they treated it like a trademark, often causing other publishers to use "Clew". This would not have affected Stratemeyer but the time when he grew up and the stories he read in his youth and into adulthood could influence his spelling preference when editing a story.

There are many things that can cause one to look more carefully at a text and find the context of words in it. Many won't have special relevance, some might. Sometimes you end up with more questions than certain answers.

For example, I know that Edward Stratemeyer preferred to hyphenate "to-day" in his letters. He may have carried this through to his edited texts. Whether they survived the decisions of the publisher editor and the typesetter (each a different company with style rules that were not always consistent).

In the published editions, "to-day" appears 94 times in 26 of the first 33 books published during Edward Stratemeyer's lifetime. None of the stories after he died in 1930 used "to-day." Is this a result? Should I write up an article for one of the publications bout this find? No. Wait.

Meanwhile, "today" appears 59 times in 19 books. These include books from the first 33 and ones afterward:

TS01 TS02 TS03 TS08 TS09 TS10 TS16 TS20 TS21 TS22 TS23 TS24 TS25 TS34 TS35 TS36 TS38 TS39 TS40

But, to say anything meaningful, I'd have to look at those uses of "today" in the published texts and ensure that some Gutenberg editor did not change the spelling. I have PDFs of the stories.

Now I see "to-day" in 32 of the first 38 stories, including ones after Edward died. Then "today" is in the same list of stories. The Apple Finder search conflates the two variants so it is not suitable for this question. A check with Devon EasyFind was not helpful. It might be necessary to extract the text layer from the PDFs with page images (likely full of OCR errors) to see what can be found. Better yet, I can open the PDFs in Apple Preview or Adobe Acrobat Pro and see what their search features uncover. It is more tedious but I can see the page image and know if what comes up in the search is printed that way on the page.

For example TS01 has several appearances of "to-day" even though the command-line counts above did not reveal the hyphenated variant in my searches. It is important to be careful and watch for edge cases that might provide misleading results.

I manually opened the 32 PDFs I have for the series. Most have page images. A few are only available in retyped versions from Gutenberg or similar sources. Five books, all from the latter sources, had "today". All of the page image books, including one Gutenberg, had "to-day." I have the physical books (one of the better collections in the world, by the way) but have not pulled them off the shelf for this minor exploration. Knowing where the Gutenberg versions have "today" I could check the five books and spot the pages fairly quickly once I note the chapters and nearby words at the beginnings of paragraphs.

I had to process the texts a few times to make sure that "the" near a hyphen or dash did not create the appearance of compound words where they are not there in the printed edition. These are grouped at the bottom of a list by word count so are easier to spot.

This shows that the tools used make a big difference in getting useful answers, if they are useful at all, to questions and counts.

James

448faktorovich
mayo 22, 2022, 8:14 pm

>445 Petroglyph: The "corpus" that I tested for this exercise is composed of 2 texts, and both of them had exactly 3.6% of "the"; so for them the "standard deviation" is exactly 0. I did not say that all of the texts in my corpus of 284 texts had exactly 3.6% of the word "the", but it is possible that all do. You have to consider the possibility that some words will have identical "standard deviations" for all texts, while others will not occur in one text, but would be present in many others, and numerous other combinations that can produce too many wildly different "standard deviations" when all words in a text or over 100 words in 284 texts are all compared on their "standard deviations".

Why don't you try a harder effort by testing 10 Renaissance plays on a word of your choosing, and then we can decide if the results make any sense, and if this method can possibly be used for attribution. I do not need to replicate these 10-text experiment because I already proved by choosing 2 random texts and a word of your choosing ("the") why it does not work. You could have proposed a different word if you had anticipated that "the" is uniquely unsuitable for this test (which you should have known if you had tested "the" in particular before with this test).

449faktorovich
mayo 22, 2022, 8:38 pm

>446 Petroglyph: I am saying that if no data is provided, it is more likely that a researcher just drew a diagram that reflects the texts current attributions (or "assumed" attributions), without actually checking where they would fall in relationship to one-another.

The idea that the "clustering software" compares all pairs of texts for shared authorship is just science fiction that is not reflected in the actual attribution method you described that requires a known-author text or group of texts be compared against a specific unknown or mystery text. Comparing all 284 texts against each other is something I have never seen in the data provided in such studies that at most compare texts against a group of texts in a given byline, such as "Shakespeare". It is statistically impossible to compare 284 texts on all of their word frequencies/ z-scores and to make sense of the chaotic sets of z-scores that range from absurd 0s to all sorts of other random outputs on the different combinations of word frequencies etc. A single text will have thousands of different z-score results for each word that will be all over the map; you still have not explained how these results are combined to arrive at an attribution for the text as a whole. Or are you only testing individual words like "murtherer"?

Everybody can see from my data that Jonson's plays share similar top-6 words, and other measurements with most of the "Shakespeare"-bylined and Jonson-ghostwritten plays. Readers can check my data and then look at these books that claim to compare these same plays without finding similarity between them. They can see my raw data and analysis or how I arrived at my conclusions, but they cannot see the data in these other studies. The logical conclusion they should reach is that my data and attributions are accurate, and these other studies have an erroneous method or have falsified their findings to arrive at a conclusion that does not match the true attribution.

"A Second Opinion" directly states that it is basing its findings on "hapax legomena", so I am not lying. As for the second critical piece, "lexical words that appear regularly in a group of texts... but only rarely in a second group", is a deliberately vague terminology that allows the critic to choose words that occur at almost any frequency between once-per-text to being among-the-most-frequent-per-text words. This is how they test both a common word that appears in every text like "gentle" and a semi-rare word that only appears in some texts such as "murderer". If they had defined precisely how many occurrences or rate of frequency is the cut-off-point, it would have been possible to more precisely prove that they are manipulating their data when they choose these obscure words in unspecified random chunks of 2000 words out of the text.

You just make statements that contradict the reality that I just proved with a test with the confidence that your lies will not be confronted. For example, I just proved the opposite of this statement: "'Gentle' is more common in Shakespearian words"; it is equally common across all texts in the Renaissance. Similarly, there is no relevant difference between "yes" and "yea"; I have already proven it is common to all texts alike.

450faktorovich
mayo 22, 2022, 8:56 pm

>447 Keeline: You have to provide a list of the texts in your corpus for readers to be able to check if your claims are true or false. Without a list of the texts you have tested, you are just providing fictitious data that appears to match your argument.

The frequent appearance of "the" is not a reason to pull it out. It is very likely that the rate of the use of "the" was identical in my two random texts because both of them were in Percy's hand; thus, the rate of "the" would have obviously identified Percy's hand if only the rate of "the" was tested and other authors in this corpus other frequencies of their use of "the". It is ridiculous that the z-score ended up being 0/0 = undefined. But this aside, testing the frequency of the most common words in a text is revealing. The problem comes in when a researcher attempts the impossible task of comparing all words in a text, or all except the 6 most common or the like, as there are too many chaotic outcomes and too many comparisons, without a defined method for sorting through them. In contrast, my test for the patterns of which words are in the top-6 most-common words provide simple match or no-match results; if the top-6 words, such as, I, and, the, to, a, my (pattern-j in the "Shakespeare" corpus) match between 2 texts, then these 2 texts have a match on this test; if they have different patterns, then they do not match on this test. I can explain the basis for a match simply in a few sentences. In contrast, no specific cut-off or basis for a match vs. a non-match across all words in all texts can ever be provided.

And when I test for the top-6 words, all spelling variants are given equal measure, and there is no chance I am missing "murtherer" because I have only searched for the "murderer" spelling, in case one of them is more frequent than the other. Meanwhile, if you are testing all words, and you find a pattern for any given word like "murderer", you in fact could have erred by comparing a modernized text to an original-spelling text, or two ghostwriters, one of whom preferred only one of the spellings for the same word. My tests would find the most-common words if a text is in old-spelling or modernized with the same precision - they top words would just be spelled differently.

451Keeline
mayo 22, 2022, 9:47 pm

>449 faktorovich:

Everybody can see from my data that Jonson's plays share similar top-6 words, and other measurements with most of the "Shakespeare"-bylined and Jonson-ghostwritten plays.

The real question is which texts don't use the same top 6 words?

Using an example from my corpus, I have listed the top 7 words from each text with the number of instances from each text. To help with seeing patterns, I have added some colors. For "Tom" I left the cell white but made the text red.

"The" is the most common word in every text, by far.

"To" is the second most common word in 24 of 39 texts. "To" has the third highest spot in 11 of 39 texts.

Once again, the top 6 words are not an indication of authorship in these texts.

Utilities and software can let you count anything and make almost any comparison. But not all of them are relevant.

James

452Matke
mayo 22, 2022, 10:15 pm

>448 faktorovich:
Would you please supply the title of the two texts in which you determined that the word “the” makes up exactly 3.6% of the words? That would prove your point, and the validity of your method in testing for it, beyond any doubt, as the works could then be checked and your analysis verified by research other than your own.

It is, statistically speaking, not possible for your corpus of 286 works to each have the word “the” at a 3.6% frequency. Or rather, that possibility is so very remote as to be a ludicrous idea.

Your thought processes are interesting.

453Keeline
mayo 22, 2022, 10:53 pm

>450 faktorovich:

You have to provide a list of the texts in your corpus for readers to be able to check if your claims are true or false. Without a list of the texts you have tested, you are just providing fictitious data that appears to match your argument.

No reply without an insult. Accusing me and others of "fictitious data," etc. is not having a discussion in good faith. You constantly take a fleeting glance at a reply or an article and make wholly untrue statements and leap to snap judgments. It's no wonder you seem to have looked at the English Renaissance for a couple months before reaching your unshakable conclusions and setting forth to rewrite history. Despite your uncivilized behavior in this whole thread, I will go a step further.

I have told you that these books are 39 of the 40 volumes in the Tom Swift series (1910-1941). But since your next question will be for me to supply the texts,

https://Keeline.com/pics/TS.zip (3.7 MB)

I can't supply what I have for TS 30-40 because those are still under copyright (published between 1927 and 1941). After your earlier comments in this regard, I'm sure you will agree. But the counts are there in ts_uniq.

James

454prosfilaes
mayo 22, 2022, 11:19 pm

>443 faktorovich: "Z-score is... Take the relative frequency of a word in one text (say, the)," (944 occurrences in Percy's "Shakespeare"-bylined "2 Henry VI" that has 26,134 words = 944/26134? = 3.6% of all words) "subtract the mean (the average of the frequency of the across all texts in your corpus)," (This is likely to be a pretty similar figure as "the" occurs at around the same rate in all texts. For example, the anonymous "Richard III" has 746 occurrences out of 20,525 words = again 3.6%. So subtracting it from the other measure gives you 0) "and divide the result of that by the standard deviation (also calculated over the entire corpus)."

944/26134 = .0361215275
746/20525 = .0363459196
so the difference between the two is not 0, it's 0.022%. When you're going to divide by zero, you never want to round off to zero if possible.

How can you tell anything from two works? You claimed that everything should be visualized; two points are just two points. You have a distance, but what does that distance mean? Add a few more, then you can get meaningful grouping, and close and far become obvious.

455bnielsen
mayo 23, 2022, 12:59 am

>447 Keeline: Nice work! I do something similar to find typos in my own reviews (which sometimes include a table of content, so that gets tested too). It's mostly looking for words that don't occur in any other review, but also for invisible characters and quote characters other than ' and ". A useful tool is to report any characters not in a list of allowed characters.

456Keeline
Editado: mayo 23, 2022, 10:16 am

>455 bnielsen:

On the Unix-like systems (I use a Mac at home and Linux at work) there are some interesting command-line tools that can do just about anything with text file manipulation once you learn how. Often there are two or three ways to do things. For example, I tightened up the commands to get a list of each distinct word from a work and the count. For me, at least, it is easier to read than the former method:

for FILE in *; do egrep -o '\w+' $FILE | tr ' \:upper:\' '\n\:lower:\' | sort | uniq -c > ../ts_uniq/${FILE}u; done

With newlines and indents it looks like:

for FILE in *; 
do 
 egrep -o '\w+' $FILE | 
 tr ' \:upper:\' '\n\:lower:\' | 
 sort | 
 uniq -c > ../ts_uniq/${FILE}u; 
done

1. Loop through the files in the current directory (can be more selective than the * wildcard).
2. Search the file name in the loop for words (\w) and display only the matching parts (-o).
3. Translate upper case characters to lower case characters.
4. Sort alphabetically.
5. Show just one copy of the word with the count (-c)
6. Send the output to a file up one directory (..) and into ts_uniq with a filename of the original story file with a "u" at the end.
7. Continue looping until the files are all processed.

There are other steps to convert em-dashes to hyphens but I did them separately on the story files.

You can strip away non-printable characters with patterns like

'\a-zA-Z0-9"\'-\'

This is hard to show on LT talk because of its propensity to use the square brackets for titles and authors and not preserve spaces.

James

457bnielsen
mayo 23, 2022, 2:03 am

> 456. I have multilingual texts and lower and upper doesn't catch all of the words but just glancing at the output found me a typo, I think:

1 Яковлевиц
1 Яковлевич

but yes, it is similar to what I already do. (Typo fixed. Thanks!)

(It was a English translation of Наум Яковлевич Виленкин: Рассказы о множествах, in case anyone wonders).

458Matke
mayo 23, 2022, 5:50 am

>448 faktorovich:
>452 Matke:
>454 prosfilaes:
Ah. I see that the texts were named, so forgive my mistake, please.

I also see that, per >454 prosfilaes:, you’ve made the mistake of rounding to the nearest hundredth. This gives a false result when you perform further computations.

459faktorovich
mayo 23, 2022, 1:25 pm

>451 Keeline: Since you are not listing the bylines of any of the texts in this particular corpus, you are not at all allowing users to check if your diagram of the top-7 words does or does not frequently separate into groups of similar authorial-signatures. As I explain in BRRAM's Volumes 1-2, the top-6 words patterns are strong indicators of signatures. For example, the most-common a-pattern of---I, to, and, the, a, you---only appears in Percy and Jonson-ghostwritten texts, and many of these were likely to have been co-ghostwritten by both of them, and one of them had an unconscious preference for this pattern. Only Percy uses the d-pattern (the, and, i, to, you, my) - in all of the 7 texts that use it out of my 284-text corpus. These 7 include: Anonymous - Nobody and Somebody, "Chapman" - Gentleman Usher, "William Rowley" - a Woman Never Vexed, "Chapman" - Blind Beggar of Alexandria, "Shakespeare" - Yorkshire Tragedy, "Shakespeare" - Thomas Lord Cromwell, and "Heywood" - Woman Killed with Kindness. These texts match each other on numerous other texts that support the top-6-word test's conclusion. Have you attempted testing this corpus of 7 texts to see if the "Shakespeare" or "Chapman" texts are any more like only these bylines, or if they are equally alike to all of the rest of the texts in this group? Do you have an explanation as to why all 7 of them use this very unique top-words pattern?

460faktorovich
mayo 23, 2022, 1:31 pm

>452 Matke: Look back in this thread and you will find that I did give the titles and spelled out the math for the two titles. Here it is again: 944 occurrences in Percy's "Shakespeare"-bylined "2 Henry VI" that has 26,134 words = 944/26134? = 3.6% of all words... the anonymous "Richard III" has 746 occurrences out of 20,525 words = again 3.6%.

All of the tests as I have described them can be checked by anybody other than myself. I have given full bibliographic entries and all the data, and method-steps needed for my method to be duplicated and audited by all interested. I did not count how many texts in my corpus have precisely 3.6% of the word "the", as it is not part of my method. I just performed this test to check if the rival formula proposed would produce a glitch, and it did. The corpus of 2 texts I tested for this mini-experiment both had 3.6% of "the"; a corpus is however many texts are tested in a given experiment, not in all experiments previously performed.

461Matke
mayo 23, 2022, 1:40 pm

>460 faktorovich:

Apparently you overlooked post >458 Matke:. And you’ve not addressed the problem of rounding off your numbers, which is the path to perdition in statistics.

462faktorovich
mayo 23, 2022, 1:42 pm

>453 Keeline: That is an un-openable set of files that are labeled by the broken term ".file", instead of .exe or .doc. So you have sent a file that is corrupted or both corrupted and deliberately designed not to open. Thus, you are doing everything imaginable to stop reviewers from checking or auditing your work. And as you have pointed out the "Tom Swift" series was written by many ghostwriters, so it is to-be-expected that it would have different top-6-words combinations to reflect the different ghostwriters, or collaborations of two or more ghostwriters that worked on them. There are a few matches in this group such as the one between TS15 and TS20, so these two are likely to be by the same hand. You can begin to build a signature grouping just from matching these 6-word patterns, but you need to apply the other 27+ tests to check which precise texts group with others, and which are unlike each other.

463faktorovich
mayo 23, 2022, 1:55 pm

>454 prosfilaes: No, you would have to do not only "a few more". The central problem with this method is that you would have to do millions of these calculations for a corpus of 284 texts with over 7 million words in it. After you or your computer is done creating a set of data points on an impossible 284+ different axis, you are going to have no stated method for determining what weight to give a strong similarity on the word "the", but a weak similarity on the word "murtherer", etc. Try figuring out these z-scores for 100 different words in 10 different texts, and then tell me how these spread all over figures can be combined to reach an attribution conclusion. Only looking at a few words such as "the" in isolation is open to glitches and is too simplistic for a method that is supposed to be relying on complex computing, but is actually only considering a single data-point. And if you are going to make a distinction between 3.612% and 3.634%, when other texts you guys have mentioned, such as TS04 has a rate as high as 7.73% of "the"; then, you are finding divergences where if the curve included many other signatures its standard deviation could have placed these texts in the same spot on the curve in contrast with outliers percentage-points away.

My conclusions are based on testing 284 Renaissance texts on 27+ different tests. I did not make my own attribution conclusions on any shortfall of texts or tests. As you can see from the data available on my GitHub. To audit an avalanche of the falsehoods you guys are pitching, I have to select isolated examples to explain what is wrong with what you are pitching.

464andyl
mayo 23, 2022, 1:56 pm

>462 faktorovich: That is an un-openable set of files that are labeled by the broken term ".file", instead of .exe or .doc.
Because of course there are only 2 file types. It is quite clearly labelled as a zip file. I have just downloaded it and can open it.

465bnielsen
mayo 23, 2022, 2:23 pm

>462 faktorovich:; That is an un-openable set of files that are labeled by the broken term ".file", instead of .exe or .doc. So you have sent a file that is corrupted or both corrupted and deliberately designed not to open.

Come on. I've just downloaded the file, unzipped it and browsed through some of the texts. This is a 100% standard zipfile with 100% standard text files.

>462 faktorovich:: "Thus, you are doing everything imaginable to stop reviewers from checking or auditing your work. "

Bwahahaha!

466Keeline
mayo 23, 2022, 3:00 pm

>459 faktorovich:

I see that you are completely unacquainted with the Tom Swift series. These stories about a young inventor all have the byline "Victor Appleton." That is a Stratemeyer Syndicate pseudonym used for this series and some others.

TS01 means volume 1 of the series, etc.

Here is a list of the volumes in the series if it helps you:

Unless you are claiming that Ben Jonson is immortal and was still writing stories into the first half of the 20th Century, his pattern is not relevant to this authorship question.

I was mainly trying to illustrate the folly of looking at the words that virtually all authors use as a significant part of a authorship attribution fingerprint.

It is like the recent case of the shooting in the NYC subway where they found a U-Haul key and told people to be on the look out for a U-Haul with Arizona plates. Guess what?

ALL U-Haul vehicles have Arizona plates.

I will say that 35 of these texts are by one ghostwriter. Two were by one person. Two were by another person. One person wrote just one text.

But the top seven words do not align with the pattern of this authorship. It is not helpful in this case and I reserve some serious doubts that they are meaningful for the English Renaissance. Indeed they may have been more misleading than informative.

James

467Keeline
Editado: mayo 23, 2022, 6:18 pm

>463 faktorovich:

The central problem with this method is that you would have to do millions of these calculations for a corpus of 284 texts with over 7 million words in it.

If, as in your questionable example, the word "the" were to appear 252,000 times in the 7 million words of your corpus (3.6%) then why would you try to make the comparison 252,000 times? Wouldn't once be enough?

If so, any estimates on the work to be involved should be based on the count of distinct words in the corpus not the millions of repetitions of words.

For example, my Tom Swift corpus of 39 books may feature 1,608,052 words but of these there are only 21,111 distinct words.

I also want you to cease your baseless accusations that I am trying to contrive my results or make it hard for you or anyone else to read them. It is not my fault that you are using a Windows computer and are not able to handle a plain text file without an extension. I suppose next you will complain that the files have only a newline character instead of a carriage return followed by a newline.

But, really, how much hand-holding do you require? It seems a couple others have been able to open the zip file (a standard way of sharing multiple files and directories) and open the files inside.

I could have used another compression algorithm such as a .tar.gz file or a .rar file.

If your computer can't handle files without extensions, add .txt to them.

I would not send a .exe (i.e. Windows executable program) and if anyone else did, you should not run it lest it be a form of malware.

A .doc file is a terrible container for texts you want to analyze with software because Microsoft makes many changes to the content including altering quotation marks, apostrophes, dashes, and more. Line endings and other issues arise and this was the point of comments in the early weeks of this thread back in December and the message has not been received, apparently.

Now, if you would like changes made to the files that I have packaged, at your request / insistence, then the first thing to do would not be to insult me and accuse me of deliberately trying to make things hard for you. Instead, cite the things that would make things easier and ask nicely. But that doesn't seem to be in your scope of social interaction.

I will mention that I have a day job and will not take any more time from my lunch break today to help you.

James

468indeedox
mayo 23, 2022, 3:37 pm

"That is an un-openable set of files that are labeled by the broken term ".file", instead of .exe or .doc."

That hurts.
What a world it must be to live in where everything turns around oneself, but only to get them.

469Petroglyph
mayo 23, 2022, 3:51 pm

>462 faktorovich:
>464 andyl:
>465 bnielsen:
>467 Keeline:

I downloaded the file, opened it, and extracted the files without issue. And I'm on a windows computer.

470andyl
mayo 23, 2022, 6:11 pm

>467 Keeline:
>469 Petroglyph:

I mean zip as a format is only 33 years old. I guess there are only a few of us who remember what there was before. tar and compress (gzip postdates pkzip), shar and uuencode for usenet, and ARC on CPM and MSDOS. Not that any of those (or the more modern tar.gz or rar) would have helped faktorovich any

471Petroglyph
mayo 23, 2022, 6:19 pm

>470 andyl:

I linked her to a .7z file in >338 Petroglyph:. She made statements in >344 faktorovich: that would indicate she'd opened the archive and had accessed the files inside.

I find it hard not to see the outline of the red flag known as "strategic incompetence" or "weaponized incompetence".

472Keeline
Editado: mayo 23, 2022, 6:38 pm

>470 andyl:

the more modern tar.gz

I never thought of them as "modern" though there may be refinements over the decades. The Zip format is the one that seems to be common on Windows so I went with that. Support to open them is a core feature in Windows (and Mac). Another advantage, rather than supplying links to 150 files or so, is that the Zip format on text files makes them as small as 10% of the original size. In this case the two directories are 11 MB on the computer and the Zip file is 3.7 MB so the savings are large but not as great but enough to be worthwhile.

The .tar format goes back a long ways. It stands for "tape archive" and was used to place multiple files in a single binary file to record on a reel-to-reel magnetic tape like the old videos of mainframe computers in the 1960s and 1970s like you'd see on television.

Meanwhile, the .gz stands for gzip or GNU zip. It was introduced in the GNU version in 1992 (a scant 30 years ago) and is a workalike for the BSD Unix utility compress which goes back to the mid-1980s.

ASCII text files are nearly as old as computers, perhaps 50 years or more.

A handy utility on Unix-like systems is file which will show:

% file TS22us
TS22us: UTF-8 Unicode text

So it is not too much of a mystery.

Since line endings are an issue with some computers and programs, dos2unix will convert carriage return+newline to just newline. The companion is unix2dos which will replace newlines with carriage return+newline for the primitive programs that can't detect the line endings and act accordingly.

Sometimes Windows does not make it obvious how to open a (text) file without an extension. This is the reason pages like this exist:

https://www.thewindowsclub.com/how-to-open-file-with-no-extension-in-windows

But basically it comes down to opening Notepad, a text editor that comes with all versions of Windows, and using its File >> Open system.

I guess I was wrong to expect that someone with a Ph.D. who is using computers for authorship attribution studies would be able to handle very standard .zip and text files without extensions.

If politely requested, I can add the .txt extensions to the files and re-zip the directories. But today's work day was more vexing than most so my patience in re-helping someone who does not appreciate it at all was strained.

James

473faktorovich
mayo 23, 2022, 8:46 pm

>464 andyl: The folder opens, but all of the files do not open; unzipping the folder does not change the fact that none of the files in it cannot be opened.

474faktorovich
mayo 23, 2022, 9:01 pm

>466 Keeline: I did not compare Jonson's signature to Swift's.

If you did not have doubts about the authenticity of a single person with the "Victor Appleton" pseudonym writing all of these texts, why did you test them. If you did have doubts, testing them with the top-7-words pattern test clearly shows that they are not all written by a single author.

The top-6 word patterns are indeed those that appear in different authors over the ages, but they are not so common that you are likely to find many matches between Renaissance and these "Swift" texts. For example, there are no texts with "was" in their top-6 words among the 284 Renaissance texts. The one similar pattern I spotted between these "Swift" texts and the Renaissance is the very common g-pattern (and, i, the, to, a, of) that mostly Percy and Jonson use. No, the 6-word patterns are not like finger-prints or something that is only used by a single author, but they do tend to separate authorial signatures into clusters, which is what is necessary in a method that combines 27 different tests. Each of them does not need to identify the signature on their own; they just have to confirm if two texts are similar or dissimilar, and the combination of these 0/1 results shows if they are indeed written by a single hand. There are some patterns that are unique only to a specific author, while they are extremely rare for all other authors, and these can give away that author on their own without further testing.

475faktorovich
mayo 23, 2022, 9:10 pm

>467 Keeline: "The word 'the' were to appear 252,000 times in the 7 million words of your corpus (3.6%) then why would you try to make the comparison 252,000 times? Wouldn't once be enough?" No. You would have to check the z-score or the degree of divergence between the frequency of the use of the word "the" in each of the 284 texts against each of the other 283 texts, so you would end up with 80,372 text-to-text comparisons just for single single word "the". Then, you will know precisely how far away from each other all of these texts are on the frequency of their use of the word "the". And then, you would have to run similar tests on all of the other words in the corpus. And then, you would have millions of data-points without any step in this method to figure out how to decipher this data into attribution conclusions. As you said, the texts might be .002/.001% off on their use of "the", but since it is not going to be all zero, you have to check all of these scores to be consistent.

The z-score counts the number of repetitions of a word in comparison to the number of its repetitions in a different text. You have to use a different formula if you are going to be merely comparing if a word does or does not appear across all texts.

Your files were not .txt files; they were just .file (which is a nonsensical extension, designed not to open).

Microsoft does not alter punctuation marks. That's just nonsensical and untrue.

476faktorovich
mayo 23, 2022, 9:16 pm

Este mensaje ha sido denunciado por varios usuarios por lo que no se muestra públicamente. (mostrar)

>472 Keeline: Files that can be opened in Notepad, open in Notepad automatically. Your files do not open in Notepad or any other standard program. They are not in the .txt format, which would indeed open in both Notepad and Microsoft Word. You and anybody else saying they opened these files in any standard program accessible to the public is lying. It is easy to send a corrupted file, and then accuse the user who cannot open of incompetence; but such insults only say that you are both a liar and incompetent.

477paradoxosalpha
Editado: mayo 23, 2022, 9:42 pm

>476 faktorovich: Your files do not open in Notepad or any other standard program. They are not in the .txt format, which would indeed open in both Notepad and Microsoft Word.

Macs don't require filename extensions to identify format. Have you tried just renaming a file or two, changing .file to .txt?

>476 faktorovich: You and anybody else ... is lying. ... you are both a liar and incompetent.

Whew. Wherever you go, there you are, Dr. Anna.

478amanda4242
mayo 23, 2022, 10:08 pm

>453 Keeline: Just wanted to say I had no trouble opening the zip file on my Chromebook or my Windows 10 laptop. Windows did ask me which app I wanted to open the file with, but Notepad was on the list of suggestions; remembering how to do a screenshot with Windows was more difficult!

479Keeline
mayo 23, 2022, 10:14 pm

>475 faktorovich:

This is really pathetic.

My files don't have a .file extension. Your Microsoft Windows unhelpfully added that. My files have no extensions at all because they are not necessary.

Obviously you don't know much about the default settings for Microsoft Word. It absolutely will alter characters in its infinite wisdom with features like "Smart Quotes." Here is one of their own links:

https://support.microsoft.com/en-us/office/smart-quotes-in-word-702fc92e-b723-4e...

This is why you don't ever want to let Word or another Microsoft program touch your files.

https://askleo.com/why_do_i_get_odd_characters_instead_of_quotes_in_my_documents...

This is largely an issue with character encoding but since you are befuddled with filename extensions, I don't expect you to know about this.

Since you are paralyzed without files with a .txt extension and can't seem to figure out how to add them yourself, here is a new zip file just for you. Notice that several other respondents in the thread had no trouble at all opening the zip and the individual files.

https://Keeline.com/pics/TS-F.zip

Windows looks for clues of how to associate a file with a program by its extension. Depending on the configuration of Windows, it may or may not show it to you.

https://www.computerhope.com/issues/ch000572.htm

But there are other ways of handling this.

If your only ability to open a file is by double clicking and letting Microsoft Windows do what it wants, you are bound to be disappointed from time to time.

James

480prosfilaes
mayo 23, 2022, 10:21 pm

>463 faktorovich: The central problem with this method is that you would have to do millions of these calculations for a corpus of 284 texts with over 7 million words in it.

We just went over this; we're using computers that have the entire computing power of the world in 1970. Millions of calculations are nothing.

an impossible 284+ different axis,

You don't have to call it dimensions, but there's no problem handing 284 different axes. The math is trivial.

you are going to have no stated method for determining what weight

I offered you one; a dot product. You can weigh that however you want, like downplaying functional words like "the" and increasing the weight on medium frequency words.

And if you are going to make a distinction between 3.612% and 3.634%, when other texts you guys have mentioned, such as TS04 has a rate as high as 7.73% of "the"; then, you are finding divergences where if the curve included many other signatures its standard deviation could have placed these texts in the same spot on the curve in contrast with outliers percentage-points away.

If I understand that correctly, yes. As I said, comparing two texts doesn't mean much; you need to compare multiple texts.

481faktorovich
mayo 24, 2022, 10:39 am

>477 paradoxosalpha: Changing the filename does not change the file-type, nor does it change a corrupted file into a usable file.

482faktorovich
mayo 24, 2022, 10:45 am

>478 amanda4242: These screen shots show: 1. You have a .txt version of these same text files that you have opened here in Notepad. 2. You are showing a separate window that indicates the corrupted file is indeed in the "file" "type of file". 3. Even if the files you sent were not corrupted and usable, all you would have sent were the word-frequency counts that anybody can get from a basic public-software counting tool for the texts in question. You still have not sent the "analysis" or the steps of how these frequencies were changed into "z-scores" and then how these were compared to determine attribution.

483faktorovich
mayo 24, 2022, 10:47 am

>479 Keeline: You have now entirely digressed into a loop where you are repeating "file-type, file-type, file-type..." I am getting off this loop to let you spin out of your own nonsense.

484faktorovich
mayo 24, 2022, 10:51 am

>480 prosfilaes: You have not answered any of my questions about how these rival method(s) actually move from the z-score to a comparison across different words and texts. No article I have looked at in this field has an answer for this question, as they all manipulate data, and manipulate the steps between counting frequency and comparing frequencies to determine attribution to fit their desired result, or skip these steps and just draw the results they want to see to reinforce current bylines.

485paradoxosalpha
mayo 24, 2022, 10:54 am

>481 faktorovich: Changing the filename does not change the file-type, nor does it change a corrupted file into a usable file.

So, you didn't try it then.

486amanda4242
mayo 24, 2022, 11:12 am

>482 faktorovich: Clearly not corrupted since I had no problem opening as a "File" type of file. And I did not send you any files. Keeline posted the file. We are not the same person. Again, careful reading will save you from making embarrassing mistakes. Or do you truly believe all these accounts are sock puppets and there is just one person behind them? Do you really think a conspiracy is more likely than you making a mistake?

487Keeline
mayo 24, 2022, 11:32 am

>481 faktorovich:

Changing the filename does not change the file-type, nor does it change a corrupted file into a usable file.

I was under the impression that you have used computers for years. It's hard to do so without learning some of the details.

Adding or changing a file name won't make a corrupt file workable or change its content. But it will help Windows associate it with a particular application. Note that other operating systems don't rely so heavily on extensions. The Mac does to a degree but not obsessively.

If you downloaded the new zip file I made especially for you ( https://Keeline.com/pics/TS-F.zip ), you will find the extensions on each file. I even went to the extra step to make sure the line endings are friendly to Windows applications.

I insist that you stop with the baseless charges. We could run a poll to how many people in the group could open the original zip file but it hardly seems necessary since several have stated and shown that they could open the files.

Do you abandon a road trip at the first construction sign? You seem to give up at the slightest obstacle. How did you finish a Ph.D. dissertation with this kind of temperament?

James

488Petroglyph
mayo 24, 2022, 7:18 pm

>486 amanda4242:

She's accusing you of having two versions of those files: a .txt file and a corrupted file; she takes your screenshots to be evidence of you opening the .txt version and pretending it's the "corrupted" version. She's convinced you're lying to her. Never mind that your screenshot clearly shows the lack of .txt extension in the Notepad title bar.

And look at how much work she demands you do before she'll treat you with respect! The goalposts have been moved so, so far in point 3 of >482 faktorovich:.

489amanda4242
mayo 24, 2022, 7:24 pm

>488 Petroglyph: Ridiculous, isn't it?

490Keeline
Editado: mayo 24, 2022, 7:45 pm

Tips about that "mysterious" .file extension that Windows adds to a file when it does see an extension. Apparently it has nothing like the Unix file command to quickly inspect the file and determine what it is.

https://file.org/extension/file

From my original zip file

% file *
TS01: ASCII text
TS02: ASCII text
TS03: ASCII text
TS04: ASCII text
TS05: ASCII text
TS06: ASCII text
TS07: UTF-8 Unicode text
TS08: ASCII text
TS09: ASCII text
TS10: ASCII text

From my new zip file

% file *
TS01.txt: ASCII text, with CRLF line terminators
TS02.txt: ASCII text, with CRLF line terminators
TS03.txt: ASCII text, with CRLF line terminators
TS04.txt: ASCII text, with CRLF line terminators
TS05.txt: ASCII text, with CRLF line terminators
TS06.txt: ASCII text, with CRLF line terminators
TS07.txt: UTF-8 Unicode text, with CRLF line terminators
TS08.txt: ASCII text, with CRLF line terminators
TS09.txt: ASCII text, with CRLF line terminators
TS10.txt: ASCII text, with CRLF line terminators

Notice how it can detect how the lines are ended (Windows style with CRLF).

It will even tell more detailed information about an image:

% file Tom_Swift-spines-56,55,57-dots.png 
Tom_Swift-spines-56,55,57-dots.png: PNG image data, 1422 x 2909, 8-bit/color RGBA, non-interlaced

James

491faktorovich
mayo 24, 2022, 8:38 pm

>486 amanda4242: If you can imagine that you are all "sock puppets and there is just one person behind" all of you; then, there must be some truth behind this imaginary scenario, or why would you mention it? I did not say that you were all "sock puppets", since I have not tested your linguistic styles, and thus have not come to a quantitative conclusion on this hypothesis of yours.

492faktorovich
mayo 24, 2022, 8:43 pm

Este mensaje ha sido denunciado por varios usuarios por lo que no se muestra públicamente. (mostrar)

>487 Keeline: By showing the screenshots of what is in the files in question, you have already proven they have nothing relevant to the data/analysis that I asked for, as I already have the word-frequency data, or can access it through standard counters. And your repeated insistence that I keep opening these files suggests you are likely to have corrupted them with a virus or the like, and not simply created un-openable files. So I will not keep downloading them, as this is an entirely pointless and potentially harmful exercise. It is amazing that you can keep repeating what you have already said about these files an unknown quantity of times. I look forward to finding out if you can post a zillion posts that all repeat the same point.

493paradoxosalpha
Editado: mayo 24, 2022, 9:16 pm

>491 faktorovich: there must be some truth behind this imaginary scenario, or why would you mention it?

Er, because you have been behaving as if it were true long before it was jokingly entertained by anyone else on this thread.

It's strange; I don't need quantitative stylometry to tell apart most of the posters in the thread--often I don't even need to check the handle in the post header. Of course, I've seen a lot of their posts in LT Talk for many years. You, on the other hand, have repeatedly confused the identities of interlocutors here. Your great intellectual achievement is supposedly the attribution of texts to their authors hundreds of years ago, and you can't seem to get it straight for clearly-labeled, honest contemporaries.

494Aquila
mayo 24, 2022, 9:25 pm

>492 faktorovich: Wow, your assumption that people here are acting in bad faith is impressive. No one is trying to send you viruses. You don't know much about dealing with different file types, so you are looking for excuses to not deal with the files at all, instead of using this as a chance to learn a bit more about it. That's a nasty and self-sabotaging coping mechanism you have there.

495Matke
mayo 24, 2022, 9:30 pm

>492 faktorovich:

It’s amazing that you can make extraordinary claims and accusations about posters here.

The merest cursory effort would reveal that most of us have well-established accounts, content, and writing styles that are widely diverse. I had not heard of or come across many of these posters until this thread.

Please stop making accusations regarding dishonesty, even if it’s intellectual dishonesty, which you can’t possibly back up.

Disagreement with your positions, methodology, research, use of statistics, or your lack of willingness to entertain the (gasp!) notion that you may be even occasionally wrong does not constitute a conspiracy.

It’s simply disagreeing with you, based on posters’ own independent research and knowledge in various fields, most notably statistics and use of computers.

Most adults can accept that others disagree with them without accusations of conspiracy, dishonesty, and incompetence.
That you cannot do so speaks very poorly for your open-mindedness, your academic courtesy, and your ability in debate.

496Petroglyph
mayo 24, 2022, 10:23 pm

>489 amanda4242:

Oh, you were being nice. Sorry for petroglyphsplaining!

Yes -- this attitude is impossible to take seriously.

497Keeline
mayo 24, 2022, 10:32 pm

Sure. We're all sock puppets for one person who has it in for you. Me, a small part of my 9,000+ book library, and Whistler. Photo by by wife.

Wouldn't it be rather extraordinary for the participants in these groups to have wildly disparate libraries cataloged on the site with accounts stretching back a decade or significantly more.

The zip files have two folders with a bunch of plain text files. You're the one who's looking to download .docx and .exe files which have been shown to carry malware in the past. Text files do not.

James

498Petroglyph
Editado: mayo 25, 2022, 12:21 am

For what it's worth, I've now posted my review of Faktorovich's Re-attribution. You can find it here.

It's my most Faktorovian work to date: it took the bare minimum of effort, and it's a review only by the merest of technicalities. In other words: precisely like the work it's a review of.

A proper review of considerable length is in the works, consisting of mostly reworked bits I posted in this thread. I'll post it somewhere else, and cross-post it here.

499librorumamans
mayo 24, 2022, 11:56 pm

>462 faktorovich: That is an un-openable set of files that are labeled by the broken term ".file", instead of .exe or .doc. So you have sent a file that is corrupted or both corrupted and deliberately designed not to open.

Unbelievable!! In 2022 someone who claims to be competent to use a computer to do statistical analysis is unable to handle a .zip of unicode text files. Of all the nonsense that this thread has engaged with, for me this (notice how easily I rendered italics again?) tiny admission strips away any shred of credibility.

Behold! Immanuel Velikovsky is reborn.

500amanda4242
mayo 25, 2022, 12:27 am

>496 Petroglyph: Forgiven. :)

I don't know why I bother to interact with her since she keeps mistaking me for other people. Am I that forgettable?

501amanda4242
Editado: mayo 25, 2022, 12:45 am

>499 librorumamans: for me this...tiny admission strips away any shred of credibility.

You must have missed the insects are not animals discussion on the last thread if this is where you think she lost her credibility.

502librorumamans
mayo 25, 2022, 12:47 am

>501 amanda4242:

And I'm grateful that I did.

503Stevil2001
mayo 25, 2022, 7:32 am

>498 Petroglyph: Bravo.

504abbottthomas
mayo 25, 2022, 7:53 am

>497 Keeline: Sure. We're all sock puppets for one person who has it in for you........

Wouldn't it be rather extraordinary for the participants in these groups to have wildly disparate libraries cataloged on the site with accounts stretching back a decade or significantly more.

More extraordinary, do you think, than the delusion that Verstegen, Harvey, Byrd et al. wrote all those books? Dr. F seems very comfortable with this world view.

The technical discussion about the maths and the computing has escaped me long ago but I can't lose my conviction that, no matter whether her figures are right or wrong, using them to justify the six ghost writer theory takes us much too far from reality.

505bnielsen
mayo 25, 2022, 9:03 am

>487 Keeline: Your pet has white socks too. A coincidence? Surely not!

506anglemark
mayo 25, 2022, 10:28 am

>497 Keeline: Our library is also 9,000+ books! Coincidence? I should think not!

-j

507AbigailAdams26
mayo 25, 2022, 11:44 am

Hello All. This is just a reminder that our Terms of Service prohibit name-calling of any kind. A number of participants in this thread have recently violated the TOS, and have been contacted privately.

Please review the TOS (https://www.librarything.com/privacy), if you are unsure of what is permitted, and be aware that repeated violations will result in suspension or removal.

LibraryThing does not review every post in detail. If you have concerns about a specific post, please flag it, or contact staff directly at: info@librarything.com.

508Petroglyph
mayo 25, 2022, 12:03 pm

>500 amanda4242:

Thanks! Turning off the didactic mode is hard sometimes.

If it's any consolation, I doubt you are the only one she doesn't think of as an independent individual.

509faktorovich
mayo 25, 2022, 12:51 pm

I am not going to engage in repetitive insult-comedy without any scientific or literary value that the last few dozen posts have been. If anybody has a question or comments about BRRAM, my attribution method, or "Who Really Wrote the Works of the British Renaissance?" I will respond, as previously.

"He will steal, Sir, an egg out of a cloister... he will lie, Sir, with such volubility, that you would think truth were a fool..." --Jonson's "Shakespeare"-bylined comedy "All’s Well That Ends Well"

510paradoxosalpha
mayo 25, 2022, 1:06 pm

>509 faktorovich: I am not going to engage in repetitive insult-comedy

Why stop now?

511Petroglyph
Editado: mayo 25, 2022, 1:14 pm

>497 Keeline:
>499 librorumamans:
>501 amanda4242:
>504 abbottthomas:

May I offer a few more quotes from the book as additional illustrations? Though I doubt any are needed now.

Most of Re-attribution is paranoid historical fiction -- a conspiratorial rewriting of history through this one very narrow six-ghostwriters-lens. And so, peppered throughout normal-sounding but badly written biographical material, Faktorovich throws sudden leaps of logic and unwarranted assumptions of forgery at the reader as though any of it were convincing to other people. Examples, with boldface added:

Descriptions includes intricate engravings by Giovanni Battista Cavallieri; given the cost of original engraving, Verstegan must have either performed the engravings himself under the “Cavallieri’s”-byline, or he had received a large sum in funding from some source by this point (p. 251)

Verstegan supported the English Mission not only as a publisher, but also as an “arranger of passports” for priests and their servants. This evidence confirms that Verstegan was officially employed as a forger and propagandist. Thus, it is a short step from this accepted history to conclude that Verstegan also forged “Shakespeare’s” signatures and simultaneously propagated for the opposition. {...} records survive of Verstegan registering several books for license by submitting them to the royal council. In at least one of these applications, Verstegan supplements his own name with another printer’s name (“Arnout Conincx”) on the title page of Primer. And Verstegan applied for a license for Speculum in 1590 under his own name, but listed “Plantin” as the printer. Given these two examples of acknowledged and deliberate mis-attributions to real or imagined and cooperating or ignorant of their claimed participation printers, it is reasonable to assume that Verstegan operated under other printer-pseudonyms. It is possible that Verstegan operated under the names of at least some of the printers claiming to be based in London or Edinburgh across the 284 tested texts. (p. 252)

(I'm not sure if she believes the printer Plantin is imagined or merely ignorant, but the building and the presses still exist, and the museum is well worth a visit.)

According to Faktorovich, the letters of Queen Elizabeth I, as well as her speeches, could not have been written/performed when historians think they were, because the members of this ghostwriting troupe she's concocted were too young to have written them.

According to the linguistic tests, Elizabeth’s published speeches (dated with years between 1559 and 1585) predominantly matched Verstegan’s signature, with secondary assistance from Jonson. In contrast, her letters (dated between 1572 and 1580) were ghostwritten by Harvey as a primary and Verstegan as a secondary. In 1559, Verstegan was around 9, Harvey was 7, and Jonson had not yet been born. While it is possible that Verstegan and Harvey were extraordinary children and began writing at these miraculous ages, the likelier explanation is that these speeches and letters were composed and edited at a much later date. The earliest year in which one of the texts in the Jonson-group is claimed to have been first-performed is 1590 (“Shakespeare’s” Two Gentlemen of Verona), when Jonson was 18; thus, the earliest date when the final tested version of the speeches could have been written is 1590. The evidence suggests that Elizabeth I either did not issue any speeches or letters prior to the commissioning of the Ghostwriting Workshop, or that she previously issued short and direct statements with her orders, instead of issuing the prolonged orations recorded in these published versions. (pp. 254-255)

(I apologize if I'm belabouring the point. But there's just so much bullshit in this book that I want to vent a little. And "bullshit" is the technical term.)

512Petroglyph
mayo 25, 2022, 1:12 pm

>510 paradoxosalpha:

Wouldn't it be fitting if any and all future posts in this thread consisted of "I am not going to engage in repetitive insult-comedy" only?

513amanda4242
Editado: mayo 25, 2022, 1:57 pm

>511 Petroglyph: quoting Faktorovich: the likelier explanation is that these speeches and letters were composed and edited at a much later date

Well, that's one explanation. That Verstegan and Harvey didn't write them is the most likely.

514abbottthomas
mayo 25, 2022, 1:54 pm

>513 amanda4242: I think you’ll find Petroglyph is quoting from the Faktorovich work.

515amanda4242
mayo 25, 2022, 1:57 pm

>514 abbottthomas: I know. I'll edit my post to make it clearer.

516librorumamans
mayo 25, 2022, 3:03 pm

>458 Matke: you’ve made the mistake of rounding to the nearest hundredth. This gives a false result when you perform further computations.

There's an xkcd for that:

517bnielsen
mayo 25, 2022, 3:09 pm

FWIW I don't understand why >492 faktorovich: was flagged.

518paradoxosalpha
Editado: mayo 25, 2022, 3:14 pm

>517 bnielsen:

I didn't flag it myself, but I could see why people did. She seemed to be sincerely accusing James of trying to infect her machine with malware.

519amanda4242
mayo 25, 2022, 3:18 pm

>517 bnielsen: I didn't flag it myself, but I can understand why others might consider her baseless accusation that Keeline is trying to send her viruses an attack.

520Stevil2001
mayo 25, 2022, 3:28 pm

>519 amanda4242: Yes, I saw it as falling under the category of "personal attack."

521Petroglyph
mayo 25, 2022, 4:26 pm

>517 bnielsen:

I didn't flag it either, but I can totally see why people did. Keeline hasn't even been all that testy with her.

522Petroglyph
mayo 25, 2022, 4:47 pm

>513 amanda4242:

All these prefaces of "it is reasonable to assume"; "the likelier explanation is"; "it is possible that"; "it is a short step", "it must have"... It's like conspiracy theorists all use the same playbook. Other pseudoscience peddlers use the same tactic: Erich von Däniken defends his Ancient Aliens pseudoscience by claiming that "it's possible" a lot. So does Giorgio Tsoukalos (the ""aliens guy"): There's the classic "Is such a thing even possible? Yes it is!", or even this piece of preaching to the choir: "Now this sounds absolutely insane, and I'm aware of how this sounds! But, if you look into the whole story, along with all the other supporting accounts, then, if you are of the same mindset, you could potentially come to the same conclusion."

I'm hesitant to read much more in this book -- what if some of its bullshit enters my long-term memory and becomes part of how I think about Early Modern England?

523paradoxosalpha
Editado: mayo 25, 2022, 5:03 pm

Yes, I love that rhetorical gambit. "Is it possible that extraterrestrials put acid in my coffee this morning? The answer is yes." Logically flawless for nearly any dependent clause.

524Keeline
mayo 25, 2022, 5:31 pm

For what it is worth, I didn't flag it either.

James

525Keeline
mayo 25, 2022, 7:07 pm

I think I recall that there is a monitor that alerts the LT staff when a reply gets multiple flags in quick succession so they can jump in and see what is going on and whether they need to take action. That or someone making a report to LT staff probably reminded them of the thread for review.

James

526faktorovich
mayo 25, 2022, 8:55 pm

>511 Petroglyph: When you quote history-changing sections out of a 698-page book, it is always necessary to include a summary of the surrounding section or chapter to explain to ground readers. And when you are claiming something is "fictitious" you have to explain what about the paragraph you are quoting is "fictitious" or why you have come to this conclusion. Merely turning a few words bold does not prove or even state anything, other than indicating you can use the bold function. You also abbreviate (...) sections that are essential to the explanation, or start sections in the middle of a thought, when the preceding paragraph (if not the various other evidence given across the rest of the book, if not the rest of the chapter) is necessary to grasp why the conclusion you are quoting from has been arrived at. And the sections you are not highlighting but including in these quotes explain partially the conclusions raised in the bold sections, as is the case with this line: "And Verstegan applied for a license for Speculum in 1590 under his own name, but listed 'Plantin' as the printer." I found a lot more evidence to confirm the claim that Verstegan operated under multiple printer-bylines or printer-pseudonyms while translating "Restitution" and some of the other texts in the series. If you keep reading beyond these fragments you should start to be convinced by the preponderance of evidence.

There is no contradiction between the existence of the Plantin Press building and Verstegan's contribution to designing or organizing this print-shop; he would not be doing the busy-work of pressing pages and ink together, but rather would have trained apprentices (one of whom was probably called Plantin, and who eventually matured from an apprentice to the lead printer) on menial tasks.

You are not including several points regarding Elizabeth I's letters/ speeches that prove the case, such as that the first of her speeches was published near the time of her death, so there is no registered proof that any of these were written prior to that point (though some letters etc. probably were written earlier, but not long before the Workshop's registered publishing activities started, as their linguistic-signatures match "Elizabeth's" letters and speeches).

The term "bullshit" would be accurately applied to the current version of the history of Britain pre-1650; I am correcting this "bullshit" was factual evidence that contradicts this propagandistic, legendary fiction.

527amanda4242
mayo 25, 2022, 9:06 pm

>525 Keeline: From the TOS: "LibraryThing staff do not review all flags. Severe abuse should be reported to LibraryThing staff."

528faktorovich
mayo 25, 2022, 9:11 pm

>522 Petroglyph: The use of phrases such as "it is reasonable to assume" is never found in absurd conspiracy theories about "Ancient Aliens". To check this hypothesis, I searched for "it's possible" in Erich von Daniken's "Evidence of the Gods: A Visual Tour of Alien Influence in the Ancient World" (2012); there were no instances of the phrase "it's possible" occurring across this book; the closest example I found was: "The South Pacific Islands are full of similar legends, and it is always possible to find a connection with the traditions in other parts of the world." The differences in Daniken's approach include: 1. using extremely vague references and generalizations ("always possible"), and 2. referring to vague, unnamed sources and concepts. In contrast, even within the brief quotes you mentioned, I provide evidence that has led me to reasonably "assume" the narrow stated conclusion. There is nothing fantastical or otherworldly about my conclusions; the absurd fantasy is the current belief that the legends of King Arthur and the "Anglo-Saxon" Hengist and Horsa are historical. If reading my book is making you concerned you are going to remember my version of the actual British history; then, you are beginning to approach reality, and you clearly do not like what you are seeing.

529faktorovich
mayo 25, 2022, 9:17 pm

>523 paradoxosalpha: It would be anti-scientific to refer to any historical conclusion reached without access to a direct confession as anything but a "possible" or a likely answer. Similarly, scientists say there is a 98% or a very likely possibility an asteroid would hit Earth; scientists do not state such a conclusion as a fact until the asteroid has already hit Earth in the past. The events I am describing happened in the past, but the Workshop did their best to avoid being sued on fraud charges, while hinting at their function as ghostwriters, and at their use of pseudonyms. They occasionally do make direct confessions, and there are some documents that prove my assertions as a fact, such as: "And Verstegan applied for a license for Speculum in 1590 under his own name, but listed 'Plantin' as the printer." It is a fact that he used his own and Plantin's names interchangeably for this single title. I am simply being very careful to avoid any leaps in my conclusion when I use terms like "possible" even when it is "impossible" for any explanation to be applied but the one I am proposing.

530faktorovich
mayo 25, 2022, 9:20 pm

>525 Keeline: The cyclical attempts to censor my freedom to share my research via flags etc. in this thread is just one additional piece of proof that my research is such that it has all of you tempted to ban the First Amendment to silence me.

531Matke
Editado: mayo 25, 2022, 9:55 pm

>530 faktorovich:

1. You do not understand the First Amendment. That amendment applies only to government(s) within the US, to include the federal government itself, which are forbidden to attempt to silence free speech. No one here, nor Library Thing itself, is part of or representative of any government.

Privately-owned organizations and private citizens do have a right to *not provide a forum* for objectionable speech.

2. Now that the First Amendment is clear to you, please look back at any of your posts that have been flagged. None of them has been flagged for your “research” or for the conclusions you have reached. Posts have been flagged for your accusations of posters attempting to infect your computer with malware, and for accusing posters of dishonesty.

Your words:
my research is such that it has all of you tempted to ban the First Amendment to silence me.

What? Ban the first amendment? Now there is a statement that is nonsensical.

“All of you”? How many posters have there been on these two threads? How many flags were on the posts in question? That’s not “all”, factorovich. Not even close. Further, you have absolutely no way to know who is flagging the posts in question. The flaggers may be members who read the threads but haven’t participated in the discussions.

And there, my dear factorovich, is an example of exactly what concerns many of us here: your reaching conclusions from inadequate information, and your making statements that are contrary to facts.

And you just keep doing it, over and over and over..

532prosfilaes
mayo 25, 2022, 11:58 pm

>504 abbottthomas: The technical discussion about the maths and the computing has escaped me long ago but I can't lose my conviction that, no matter whether her figures are right or wrong, using them to justify the six ghost writer theory takes us much too far from reality.

Yeah. I'm pretty sure you could drop the 285 sample texts into Stylo and interpret the tree that comes out in a way that gives you the six ghostwriter theory. Tossing an entire (early) Gutenberg CD into Stylo, well, it's obvious that Vergestan couldn't have written the KJV, as it was written by the same author as the Book of Mormon and the Wizard of Oz, so it obviously dates much later.

533andyl
mayo 26, 2022, 3:43 am

>530 faktorovich:

What Matke said.

But also your posts that were flagged were accusations that Keeline had embedded a virus in a zip of text files or deliberately sent you corrupted files. That wasn't (and isn't) part of your research. I don't think (although I haven't checked) posts which directly relay some of your 'research' have been flagged (at least not enough to hide them) however risible people think they are. So don't try and use the "people are suppressing my research" line - it is obviously untrue.

Finally, flagging merely hides the objectionable content. Your messages are still there, still readable. Flagging is the community, and btw my guess is that the flags mostly come from those who are not active participants in the discussion, saying that you have crossed over a line with a couple of posts.

534Petroglyph
mayo 26, 2022, 4:12 am

>528 faktorovich:

I searched for "it's possible"

Did you search for "it is possible" or "possibly" or "could be" or "could have been" or "conceivable"? Probably not. Why would you possibly put in some good-faith effort?

The differences in Daniken's approach include

I assume you mean differences between Von Däniken and yourself. I think you overestimate the distance between the two of you.

Within anti-stratfordians, Baconians and Oxfordians aren't convinced by each other's ideas, and try and show the other is wrong. Group theorists disagree with both. Evangelicals don't think catholics are properly christian; catholics think evangelicals are embarrassingly literalist and have completely missed the point of being christian.

None of these kook-internal or cult-internal debates are of much theoretical significance to outsiders. You're an anti-stratfordian, Faktorovich. The precise flavour does not matter -- Looney, Freud, Emmerich, Faktorovich: they're all bullshit. In fact, you go beyond this mere anti-stratfordianism and go on to reimagine so much more of history -- all rooted in your homegrown little ghost-writing obsession. Being a pseudohistorical conspiracy theorist is what makes you deserving of being classed together with other pseudohistorical conspiracy theorists, such as Von Däniken and Tsoukalos. Because that is what you are.

I provide evidence that has led me to reasonably "assume" the narrow stated conclusion

Your idiosyncratic little booklets in which you rewrite history based on this one little idea that you believe in very strongly (6 ghostwriters penning over eighty years' worth of letters) are just as convincing as Von Däniken's history-changing works that contain many lists of evidence and that are based on this one little idea that he believes really strongly in (Ancient Aliens mating with humans and producing homo sapiens).

King Arthur

You're a little ahead there: I don't think you've mentioned any arthuriana yet. (Oh frak. I've just extended this thread by another 150 messages, haven't I?)

535Petroglyph
mayo 26, 2022, 4:15 am

>528 faktorovich:

I've gathered a bunch of quotes from two books by Von Däniken I happened to have handy. In each of these, he tries to make some pseudo-historical nonsense seem more acceptable by painting it as something that's possible, conceivable, that could have happened.

The parallels with your "it is reasonable" and "it must have" and "it is likely that " etc. etc. are obvious.

Any boldface in the following quotes are my own added emphasis. The translations from German are my own.

Von Däniken, Erich. 2008. Falsch informiert! vom unmöglichsten Buch der Welt, Henochs Zaubergärten und einer verborgenen Bibliothek aus Metall. 2. Aufl. Rottenburg: Kopp.
(“Misinformed! About the most impossible book in the world, Henochs magical gardens and a hidden library in metal”)

Selbst wenn also die Pergamente und die Tinte des Voynich-Manuskriptes erst 200 Jahre alt wären, weiß man immer noch nicht, wie alt der ursprüngliche Inhalt tatsächlich ist. Es könnte ja sein, dass sich nach einer gelungenen Entzifferung plötzlich Abgründe in ein vergangenes Wissen auftun, welches die Welt verändert
Even if the parchment and the inks of the Voynich manuscript were only 200 years old, we still wouldn't know how old the contents in actuality are. It could be possible that a successful decryption would suddenly open up a chasm into a knowledge of the past, which would change the world.

Ich dachte an die Mormonen und Henoch, sagte mir immer wieder: Unmöglich ist das alles nicht. Irgendwo auf diesem Globus lagen uralte Überlieferungen, weshalb nicht in Ecuador? (p. 131)

I recalled the Mormons and Henoch, kept repeating to myself "None of this is impossible. Somewhere on this globe were ancient traditions, why not in Ecuador?"

Henochs Bücher werden versteckt. Möglicherweise sind einige in der Großen Pyramide, andere - oder Abschriften - gelangen zu Laban irgendwo im Räume Jerusalem. (p. 182)

Henoch's books were hidden away. It is possible that a few have ended up in the Great Pyramid, and others -- or copies -- have ended up with Laban somewhere around Jerusalem.

Zur möglichen Lösung der unzähligen Widersprüche muss ich eintauchen in eine fantastische Realität (p. 188)

For a possible solution to the innuberable contradictions I must dive into a fantastical reality {The next paragraphs delve into this fantastical scenario: aliens might have ruled earth like gods thousands of years ago; they mated with humans; floods were caused by H-bombs or a meteorite; this brought about "god wars" which are recorded in myths all around the globe. Etc etc.}

{Context: Von Däniken talks about how religious traditions and the constant supernatural threat of punishment manage to transmit the same message more or less intact for thousands of years; religions could have been started in order to preserve ancient messages from the god-aliens}
Nicht-Wissen, Angst und Priesterschaft machen das Undenkbare möglich. Seit Jahrtausenden. Wir können alle falsch liegen, was den Ursprung der Religionen betrifft.(p. 192)
Absence of knowledge, fear and a priesthood make the unthinkable possible, and they have for thousands of years. It's possible we're all mistaken concerning the origins of religion

Von Däniken, Erich. 1973. Chariots of the Gods? Unsolved Mysteries of the Past. New York: Putnam
(Originally published in German in 1968.)

The space-ship would be as big as a present-day ocean liner and would therefore have a launching weight of about 100,000 tons with a fuel load of 99,800 tons, i.e. an effective pay load of less than 200 tons.
Impossible? Already we could assemble a space-ship piece by piece while in orbit round a planet. (p. 7)

In that case, what purpose did the lines at Nazca serve? According to my way of thinking they could have been laid out on their gigantic scale by working from a model and using a system of co-ordinates or they could also have been built according to instructions from an aircraft. (p. 17)

Besides being a first-hand report, the Epic of Gilgamesh also contains descriptions of extraordinary things that could not have been made up by any intelligence living at the time the tablets were written, any more than they could have been devised by the translators and copyists who manhandled the epic over the centuries. For there are facts buried among the descriptions that must have been known to the author of the Epic of Gilgamesh, if we look at them in the light of present-day knowledge.
Perhaps asking some new questions may throw a little light on the darkness. Is it possible that the Epic of Gilgamesh did not originate in the ancient East at all, but in the Tiahuanaco region? Is it conceivable that descendants of Gilgamesh came from South America and brought the Epic with them? (p. 49)

I have already mentioned the physically impossible ages of the Sumerian kings and the biblical figures. I asked whether these people could not have been space travellers who prolonged their life-span through the effect of the time shift on interstellar flights just below the speed of light.(p. 84)

Von Däniken’s drivel is heavy on suggestive rhetorical questions, intimations that things are possible, or “not impossible”, things that "could have been" the case, and liberal use of the German verb können “to be possible” and the adverb möglicherweise “possibly.” So many rhetorical questions!

536Petroglyph
mayo 26, 2022, 4:18 am

>530 faktorovich:

LOL

537Aquila
mayo 26, 2022, 4:20 am

>535 Petroglyph: Are you producing incontrovertible proof that Von Däniken has actually written everything signed by Faktorovich?

538Petroglyph
mayo 26, 2022, 5:26 am

>537 Aquila:

If we could show that both Chariots of the gods? and Re-attribution had a similar number for characters per word or lexical density, or if they shared a three-word phrase, then that ought to be enough evidence to convince at least one person in this thread!

It's an interesting question. Could Von Däniken have written Faktorovich's books?

Von Däniken imagines himself as someone who dares to think differently and who makes revolutionary claims. The first paragraph of the introduction to his "Chariots of the gods?" reads: “It took courage to write this book, and it will take courage to read it. Because its theories and proofs do not fit into the mosaic of traditional archaeology, constructed so laboriously and firmly cemented down, scholars will call it nonsense and put it on the Index of those books which are better left unmentioned. Laymen will withdraw into the snail shell of their familiar world when faced with the probability that finding out about our past will be even more mysterious and adventurous than finding out about the future.” He claims that his assertion of Ancient Aliens is “revolutionary. It shatters the base on which a mental edifice that seemed to be so perfect was constructed” (p. viii).

Faktorovich has similar delusions of grandeur. On page 1 msg #282 of this thread, she said "You are saying that the history of Britain is my burden alone. Not you, or any other scholar, but me specifically. I agree." And just like Von Däniken anticipates being placed on some Index Librorum Prohibitorum, Faktorovich writes in msg #530 "The cyclical attempts to censor my freedom to share my research via flags etc. in this thread is just one additional piece of proof that my research is such that it has all of you tempted to ban the First Amendment to silence me."

(Also: search Page 1 of this thread for "censor").

More parallels? Here are a few. (All emphases added by me.)

Here is Von Däniken:

Since we are not prepared to accept or admit that there was a higher culture or an equally perfect technology before our own, all that is left is the hypothesis of a visit from space! As long as archaeology is conducted as it has been so far, we shall never have a chance to discover whether our dim past was really dim and not perhaps quite enlightened. (p. 28)

Compare this to Faktorovich’s positioning of her claims against those of the stodgy old mainstream:

In their conclusion, they state there are only three possibilities presented by their evidence: 1. “Shakespeare” wrote Arden entirely but textual corruptions led to some inconsistent results. 2. “Shakespeare” collaborated with another writer. 3. Both of these are true. Given these choices, the only conclusion they can exclude is “that Shakespeare had no hand in Arden”, and thus, it must be a part of “the canon of his works”. This reasoning is grounded in the fallacy of exclusion. They exclude the possibility that the evidence indicates as the true option: “Shakespeare” did not write some or all of the texts currently attributed to him. (p. 557)

Giorgio Tsoukalos:

Now this sounds absolutely insane, and I’m aware of how this sounds! But, if you look into the whole story, along with all the other supporting accounts, then, if you are of the same mindset, you could potentially come to the same conclusion.

Faktorovich:

The claim that six ghostwriters penned the British Renaissance is intuitively going to appear preposterous to all readers of this introduction. Only those who continue reading the evidence presented in the body of this book can find the tools to fight against the ingrained myths associated with figures such as “William Shakespeare” and “Queen Elizabeth I”. (Re-attribution, p. 10)

Von Däniken:

If we work on the hypothesis that the Epic of Gilgamesh came to Egypt from the Sumerians by way of the Assyrians and Babylonians, and that the young Moses found it there and adapted it for his own ends, then the Sumerian story of the Flood, and not the biblical one, would be the genuine account.
Ought we not to ask such questions? It seems to me that the classical method of research into antiquity has got bogged down and so cannot come to the right unassailable kind of conclusions. (p. 50)

Faktorovich:

The impassable problem with their analysis is in their belief in the mythological biography of “William Shakespeare”, which prevents them from deriving scientifically sound deductions based on their own generated data-sets. (p. 557)

Von Däniken:

{the classical method of research into antiquity} is far too attached to its stereotyped pattern of thought and leaves no scope for the imaginative ideas and speculations which alone could produce a creative impulse. (p. 50)

Faktorovich:

For example, Horton fails to anticipate that Henry VIII might have three hands in it. Horton also does not open his imagination to the possibility there is no single “Shakespeare”, but rather four ghostwriters composing under this pseudonym. (p. 632)

Von Däniken:

This assertion {of Ancient Aliens} is revolutionary. It shatters the base on which a mental edifice that seemed to be so perfect was constructed. It is my aim to try to provide proof of this assertion. (p. viii)

Faktorovich:

The numbers unshakably determine that there were only six ghostwriters who wrote the British Renaissance, and the rest of this study will prove this history-changing conclusion. (p. 7)

Or even >420 faktorovich:. "You are afraid of my interpretation of history based on evidence that contradict the mythological history that you have come to believe to be truthful"

Until we perform a certainty-generating stylometric test of these authors, we cannot reject the hypothesis that there is a single ghostwriter at work here.

And then, we'd have to involve many more authors, too. We cannot just assume that the bodies of work traditionally attributed to Erich Von Däniken and Faktorovich are, in fact, by these bylines. We'd have to test more texts by many more bylines to hunt for similarities.

We'd have to start looking into financials and read the biography of these bylined names in the light of ghostwriting. We'd have to ask questions like, why would Faktorovich pay someone to write books for her? She's on record as having ghost-written for someone, so that only confirms her experience with the practice.

Clearly, Faktorovich is excluded as Von Däniken's ghostwriter: she was not even born when his first book came out. Or perhaps it is reasonable to assume that that book was forged and back-dated! Maybe Chariots of the gods? was written by Faktorovich!

Now, of course, it is possible that cranks and pseudoscientists resort to similar ways to make their kooky poppycock palatable to a larger audience. But the likelier explanation is ghostwriting.

539librorumamans
Editado: mayo 26, 2022, 12:17 pm

>528 faktorovich: I provide evidence that has led me to reasonably "assume" the narrow stated conclusion

Unless in my decrepitude I have lost the thread of logic, one does not assume conclusions, even within quotation marks.
ETA: In the olden days I believe that was a tautology.

540Petroglyph
mayo 26, 2022, 12:49 pm

>528 faktorovich:

The use of phrases such as "it is reasonable to assume" is never found in absurd conspiracy theories about "Ancient Aliens".

See, the trouble with such absolutist statements is that it only takes a single counterexample to disprove them.

Are Aliens to blame for the extinction of dinosaurs?

{...}
All these factors led to the assumption that the true cause of the extinction of the dinosaurs is some advanced race from outer space. Moreover, this may not be the first “purge” that aliens conduct on Earth. And judging by the cycles of extinction, it can even be the fifth!

The reason aliens could be guided by destroying dinosaurs on the planet is because the animals posed a threat to their new human project. Aliens could prepare territories for the evolution and development of mankind in order to subsequently be able to provide a safe and reliable world for life.

It is reasonable to assume that the idea of a human farm created by aliens is visible in this. However, in order to dominate the planet and subsequently give us the opportunity to live, it was necessary to replace the dinosaurs with another species.

541faktorovich
mayo 26, 2022, 2:44 pm

Este mensaje ha sido denunciado por varios usuarios por lo que no se muestra públicamente. (mostrar)

>531 Matke: It is good that you are vigilantly aware of the First Amendment law. But I stated "all of you tempted to ban the First Amendment"; I did not say that I believed the First Amendment has any power to give me free speech on a private business's website, or if banning it would allow you to ban my free speech. It was meant to be a joke, but by objecting to Free Speech, you have turned it into a tragedy.

I have only stated facts in all of the posts that have been flagged. 1. The files in question were corrupted and could not be opened; and this gave me reasonable cause to state that after opening the files in question twice to check if there was any way to open them, I would not open a new version out of a genuine fear this time a virus might be attached. Flagging a cautious internet user for refusing to open a suspicious file is like ticketing a woman who refuses to be alone with a man who has been previously accused of sexual assault by a dozen women. 2. And I have proven the presence of blatant academic fraud by proving data, conclusions and other aspects of research were manipulated and undisclosed in each instance where I have made accusations of such fraud. Whistle-blowing against academic fraud in research is a protected right. Here is an example of how it is protected in "Whistleblower's Bill of Rights - APPENDIX A" https://ori.hhs.gov/whistleblowers-bill-rights-appendix#:~:text=Whistleblowers%20have%20a%20responsibility%20to%20act%20within%20legitimate%20institutional%20channels,e. : "Whistleblowers have a responsibility to act within legitimate institutional channels when raising concerns about the integrity of research. They have the right to raise objections concerning the possible partiality of those selected to review their concerns without incurring retaliation." I have a "responsibility" and not merely a "right" to point out academic fraud wherever I see it. Anybody who flags or attempts to censor my thoroughly researched findings of the existence of academic fraud is committing "retaliation" against me, which is prohibited by this "Bill of Rights", as well as various other laws/ regulations of this type. These rights and responsibilities are even more relevant when communication takes place on a private business's forum, as businesses have an obligation to protect whistleblowers from retaliation.

542faktorovich
mayo 26, 2022, 2:46 pm

>532 prosfilaes: If neither you nor anybody else has attempted testing the 284 texts with Style and providing the resulting data and analysis; then, there is no point in speculating what the result would have been.

543rosalita
mayo 26, 2022, 2:59 pm

>541 faktorovich: Are you truly trying to say that you think sending a computer virus is the same as sexual assault? Extraordinary.

Also, "gave me reasonable cause to state" does not equal "stating a fact."

544faktorovich
mayo 26, 2022, 3:03 pm

>534 Petroglyph: I searched for the exact phrase Petroglyph cited, "it's possible", as being frequently repeated in the exact source he cited. If he meant to say that the general idea of approximate-knowledge is common in this source, he should not have insisted this specific phrase occurs frequently in it. When he stated that this phrase repeats frequently, this allowed me to test this specific assertion to prove that it is not true. This is how research works. Petroglyph used this phrase because it is one he uses frequently conversationally, as is the case in his post 338 "I included that section to show that a) it's possible, and b) it's easy." In objecting to the manner in which I phrase my findings, Petroglyph specifically stated: "Erich von Däniken defends his Ancient Aliens pseudoscience by claiming that 'it's possible' a lot." Because Petroglyph did not use softer language, but rather insisted that it is a fact that Daniken uses this specific phrase "a lot"; it was possible for me to check the frequency of this phrase's use to prove that this statement was false. In contrast, my use of softer phrases that avoid complete certainty make it extremely difficult for any researcher to prove that any such statements are false, as some degree of likelihood is guaranteed by the evidence that I present in support of each of my conclusions.

You are absolutely wrong in claiming I fit with any of the "conspiracy theories" you cite. One simple way of disproving your conclusion is to argue against the claim that I have written "little booklets"; each of the 17+ books so far in BRRAM is over 112 pages, with one at 698 7X10" pages, and "Restitution's" original edition was over 380 pages. Here is a source that defines a "booklet" as something with under 20 pages: https://publuu.com/knowledge-base/what-is-a-booklet/#:~:text=The%20term%20itself.... So, if you are wrong about the size of the books in this series, it is still less likely you can be right about what is inside these books.

545faktorovich
mayo 26, 2022, 3:13 pm

>535 Petroglyph: You have demonstrated that Daniken's arguments leap to irrational rhetorical questions after presenting unresearched, uncited and general claims about a myriad of unrelated things. In contrast, my "possible" classifiers come after I have presented thoroughly researched, fully cited and very precisely focused on proving a specific point evidence. If I stated my conclusions without the "possible" classifiers, they would still be true, since I have presented enough evidence to show they are true. In contrast, Daniken leaps to all sorts of hypothetical possibilities that are unrelated to what he was saying before, and thus without his "possible..." conclusions, he would just have bits of unrelated babbling without the grand final claims that attract readers to his outrageously grand claims. If you do not see the difference between a scientist being careful to state an asteroid has a 98% "possibility" of striking Earth, and a hobbyist who claims all asteroids are "possibly" alien ships who are on investigative fly-bys of Earth... then, it is unlikely that you can separate truth from fiction.

546faktorovich
mayo 26, 2022, 3:24 pm

>538 Petroglyph: The quotes you cite have no linguistically significant elements in common. They also do not cite the same types of evidence, or use logic in the same manner. And if you quoted from the paragraphs around these conclusions from Daniken, the rest of the content would be similarly nonsensical, whereas if you quoted from the evidence where I prove my points, your rational readers would come to agree with my conclusions. In some of the quotes you cite here, I specifically warn readers that reading my conclusions without the evidence will make them appear difficult to believe because they contradict the accepted history of Britain. You are absurdly refusing to look at the evidence I recommend for readers to read first before judging these conclusions; and you are attempting to suggest the conclusions are unbelievable on their own. Yes, the conclusions are difficult to believe. This is why some scientists are hesitant to share their research with the general public, as groundbreaking sound appears unbelievable when it is compressed into a sound-bite or a quote, and the reader has to put in the work to understand it by reading over all of the evidence. If there are 17+ volumes in a series that proves that six ghostwriters wrote the British Renaissance; they are all necessary in their entirety to prove the case, so until you have read all of them, you are judging the conclusion without checking if the evidence fully proves it to be true.

547amanda4242
mayo 26, 2022, 3:32 pm

Faktorovich if you truly believe everyone here is in conspiracy against you and actively trying to harm you, then why are you still here? Why have you not left this site and reported it and all of us to a law enforcement agency?

548librorumamans
mayo 26, 2022, 3:42 pm

>547 amanda4242:

Indeed.

To me it is a telling indication of the value of this whole project that Ms Faktorovich is willing to engage extensively in a flame war in an obscure corner of a niche web site with a random bunch of yobs like us when she might be using that time in publishing and presenting papers to academic specialists in her area of study.

549Matke
Editado: mayo 26, 2022, 4:05 pm

>541 faktorovich: Your attempt at humor certainly fell flat.

Numerous other posters, who are not connected to each other, could open the files you call “corrupted”, and yet rather than entertain the idea that you, for the first time in your life, might be making some sort of mistake, you immediately leap to the conclusion that people are trying to maliciously harm your computer and thus harm you. That’s a very, very sad and scary mental world you live in.

There have been no cases of academic fraud here. And so of course you haven’t found any. You have used different methods to attempt to show that others are wrong. And that’s fine but it doesn’t prove fraud. Stating that someone (or several someone’s) have “hidden” or “manipulated” data doesn’t make it so. And all your wandering bloviation doesn’t prove it either, to anyone except yourself.

For heavens’ get down off that high horse of indignation at the very idea that people genuinely and honestly disagree with both your methodology and conclusions. That’s just part of life. People disagree. When you posit a theory so very far-fetched that it’s literally breathtaking, you must expect incredulity and attempts to prove that you are, in fact, wrong. That doesn’t mean that people are”committing academic fraud” or that they’re out to get you. That’s only in your very active imagination.

And I also ask why you haven’t turned us all in, especially since you’ve made so many, many accusations here, from plagiarism to fraud.

>548 librorumamans: Proud to be part of the yobs here.

550Keeline
mayo 26, 2022, 4:20 pm

>541 faktorovich:

all of you tempted to ban the First Amendment

You keep repeating "tempted" here when the conventional usage would be "attempted." The word "tempted" is related to "temptation" and has a different connotation and sense than "attempted" which is effectively "try to do." This reminds me of the confusion between "plagiarism" and "copyright infringement" that I raised previously and the exchanges about "peaked" vs. "piqued." Using the best word is important, especially since you are "translating" and "modernizing" older texts. Sometimes the dictionary definitions don't give a full sense of the usage of a word in English. When they have the same root they can seem similar but still not be the best to communicate your point.

This sets aside the notion that you are completely wrong about this. No one has encouraged LT to remove this two-part thread. At least if they did, it was not effective since it is still here and all of the participants seem to be available for comment.

Even though you were directing your allegations at me, I have no interest in seeing you silenced. Your ideas need to stand on their own merits and your ability to argue them. That can't be done if you are muted.

If you are using Windows, you absolutely should have some malware detection software. It isn't a perfect shield but it makes it harder. Most attacks are not targeted though. The bad guys try to infect as many computers as possible and usually want to keep their presence unknown so they can do what they want with your machine. A more immediate threat is when someone encrypts your hard drive and demands ransom, often in Bitcoin,

But I think it is fair that adults should be able to have an intellectual / academic discussion without resorting to emotions or accusations. If you could not open the files, you could ask for help or a fresh copy that might be more compatible with your system. The second zip file was intended to do this. Remember that you asked for the files in the first place since you did not trust my ability to count them or present honest information.

Several others could open the files, even on Windows. When the file was opened without an extension, the system asked which application to use. They chose Notepad but something else could be used, even Excel, which might have certain advantages for resorting the content.

The purpose of showing the top 7 words from 39 Tom Swift series was to illustrate that just because something can be counted does not mean it is necessarily relevant to an authorship attribution. I don't know if this is one or six of your tests. But if it is not relevant in this example, how much can it be relied upon for 284 texts that are about 400 years old with many editors affecting the text that is available now?

I know that you are sold on this kind of measurement and nothing I write will convince you. But others get to read this and they can decide which argument is more convincing.

The occasion of this count did cause me to digitize the missing text from the series, Tom Swift and His Ocean Airport (1934). I had the book but hadn't digitized it yet. So that was a worthwhile benefit of this two-part thread. Aside from the volumes I discovered on the family bookshelf as a kid, this was one of the first books I bought at the antiquarian bookstore I soon after managed for a dozen years (1988-2000). It was not one of the best stories though. It was hard to believe that it was the same ghostwriter of the largest group of books from this series, someone who did more than 315 books for the Syndicate plus many more, but the archival evidence indicated that my sense was wrong on this. The source of the outline was different since Edward Stratemeyer died in 1930 and his daughters continued the Syndicate business for more than 50 years compared to his 25 years. It will be several years more before I can share the file with anyone but I have it for analysis.

I also learned about tools that are in use and some of the LT members who use them. After having to write my own programs to implement one method 20 years ago, I was pleased that there are some easier tools out there. I've even installed some of them but have not had a chance to try to use them extensively.

James

551faktorovich
mayo 26, 2022, 8:53 pm

Update: My article, "Falsifications and Fabrications in the Standard Computational-Linguistics Authorial-Attribution Methods: A Comparison of the Methodology in 'Unmasking' with the 28 Tests", has been published in the Spring 2022: Volume 31 issue of the Journal of Information Ethics (pp. 56-75). https://en.wikipedia.org/wiki/Journal_of_Information_Ethics This article explores the errors and fraudulence in previous computational-linguistic studies, so you guys can now access this issue to read the details of my broader argument on this subject.

552faktorovich
mayo 26, 2022, 8:57 pm

>547 amanda4242: It seems you guys are the ones who believe there is a "conspiracy against" me and somebody is "actively trying to harm" me, as you keep repeating this point. I have been describing a conspiracy that happened in the British Renaissance. If I "believed" there is a proven conspiracy happening in this thread where all of you were "actively trying to harm" me, I would indeed have reported this matter to a law enforcement agency.

553faktorovich
mayo 26, 2022, 8:59 pm

>548 librorumamans: As I said, it is my "responsibility" to respond to all cases of academic fraud that are brought to my attention because they are advertised on a thread that is about my BRRAM research.

554faktorovich
mayo 26, 2022, 9:17 pm

>550 Keeline: I use "tempted" when I mean "tempted" and "attempted" when I mean to say "attempted". If you guys imagine that I use one when I try to say the other, but my own sentences do not support this hypothesis; then, you are just attempting to label as an error something you made up.

My Norton malware software detected a breach on the day when I mentioned I was concerned about viruses being sent, and thus would not open suspicious files. I would not have raised the issue if there was not a documented attack on my computer. It was not harmful since Norton blocked it, but it is irrational to dip into the fire to discover if it will eventually cause harm. I do not have anything of value on my hard drive, so a ransom-attack would not be logical.

Computers only ask for which application a user wants to use when they cannot find any application in a system that could open the file in question.

The top-6 words is 1 of my 27 tests. I do not recommend using this or any of the other tests in isolation or as a solo-test. My method specifically involves over 25 different tests to avoid glitches that come up when only 1 of these tests is used. Since you did not even apply the 1 test properly, and did not apply any of the other 26-tests, it was impossible for your results to be useful. For example, you failed to categorize the top-6 words into patterns to check which of your texts matched, or did not match each other. And once you had an answer from this test regarding there being multiple signatures in this corpus, you simply rejected it as untrue because only a single pseudonym was used for them. Such a rejection of multiple-hands participating in this texts is irrational, since this series is known to have been written by a group of different ghostwriters. Thus, even if the test worked perfectly to separate the different authorial-hands, you are basing your conclusion on if the results match what you imagine the attribution answer should be and not on if it arrived at an accurate attribution for the corpus.

555amanda4242
mayo 26, 2022, 9:40 pm

>552 faktorovich: It seems you guys are the ones who believe there is a "conspiracy against" me and somebody is "actively trying to harm" me, as you keep repeating this point.

You have repeatedly accused other posters of fraud, claimed we are trying to censor you, accused Keeline of trying to send you computer viruses, and quoted laws protecting whistle-blowers, but somehow we're the ones who think there's a conspiracy against you?? That's some...interesting mental gymnastics.

556Petroglyph
mayo 26, 2022, 10:05 pm

>544 faktorovich:

So you took a paraphrase as a literal, direct quote? When you say "as being frequently repeated in the exact source he cited", which exact source did I cite? Please point out to me where I cited that book you used.

I did not quote him exactly, so any complaints you have that are based on that mistake are null and void.

And>523 paradoxosalpha: knew what I was talking about.

little booklets

You speak Russian, right? At least, you claim to. If indeed you do speak that language, you should know that diminutives can mean many, many more things than just "little". They are used to indicate closeness, intimacy, familiarity, and a whole host of other attitudes. Sometimes, as when I refer to your volumes of bullshit as "booklets", a diminutive carries a dismissive attitude.

557Petroglyph
mayo 26, 2022, 11:50 pm

>545 faktorovich:

Ancient Aliens theorists, cryptozoologists, young earth creationists, moon-landing-deniers, flat-earthers, 9/11 truthers and antistratfordians all cite what they believe is "evidence" for their kooky poppycock (this list is not exhaustive). You are no different, Faktorovich. Your "evidence" is of the same quality: poorly researched, completely bonkers, untethered to reality, and with a heavy dose of a persecution complex and sprinkled liberally with anti-intellectualism.

Von Däniken's Chariots of the gods? is a book-length list of evidence that, so Von Däniken claims, is unexplainable by traditional archaeology, does not fit within the mainstream models, but is explainable by his pseudohistorical nonsense theory. He's published many books since, extending and refining his argumentation.

Lee Strobel's The case for Christ is a series of fourteen interviews with experts in some field related to biblical history, biblical archaeology, theology and apologetics. All the experts are "cross-examined" and are asked tough questions that doubters might have. The experts answer them. Strobel goes to great lengths to make his pseudohistorical religous poppycock sound factual, reasonable and true -- that is the explicit purpose.

Mark Sargent's series of Flat Earth clues (a classic!) goes through so many different lines of evidence: the drop in temperature and the appearance of icebergs and frozen wastelands as you get close to the edge, the increase in temperature and the appearance of magma as you dig down, the way that layovers on flights between two southern-hemisphere airports make more sense on a flat-earth model than on a globe model.

The "evidence" provided by these conspiracy theorists is generally only convincing to themselves, to other insiders, and to people who already accept many of the preconditions to becoming an insider.

Daniken leaps to all sorts of hypothetical possibilities that are unrelated to what he was saying before

When you understand why you reject Von Däniken's pseudohistory, you'll understand why I reject yours.

558Petroglyph
mayo 26, 2022, 11:52 pm

Whoever keeps flagging my review of Faktorovich's conspiracy theory booklet as "not a review", please know that it's a perfectly acceptable review here on LT, as per the site owner's say-so.

559amanda4242
Editado: mayo 27, 2022, 12:16 am

>558 Petroglyph: The flags may have nothing to do with this thread, and may just be some overzealous people who look at all recent reviews. If you post in the Flaggers! group they'll help with counterflags.

560paradoxosalpha
mayo 27, 2022, 12:17 am

Yes, since that review is now at the top of "Hot Reviews," it's likely to get some uninformed attention. I'm not surprised to see the flags from those not steeped in LT lore.

561librorumamans
mayo 27, 2022, 12:24 am

In this week's book section, Le Monde briefly discussed Affaires de style : du cas Molière à l'affaire Grégory, la stylométrie mène l'enquête by Florian Cafiero & Jean-Baptiste Camps.

Available in July (or perhaps already available – the date is ambiguous) from Le Robert.

The book seems relevant to this thread.

562Petroglyph
mayo 27, 2022, 2:18 am

>559 amanda4242:
>560 paradoxosalpha:

I see. I was too narrowly focused.

I'm not going to bother the Flaggers! group with this now. I'll let the wave of attention die down first.

563Petroglyph
mayo 27, 2022, 2:24 am

>561 librorumamans:

Thanks for the tip! Looks like an interesting popular-science book of the technology. Most cases I've already read about, but I'd want to read that Molière chapter -- I don't even know whether there is a Molière Authorship Question.

Unfortunately: Il vous reste 78.06% de cet article à lire. La suite est réservée aux abonnés. (Though I guess most of that is the reviews after this one.)

I'll have to see if ILL will get this for me after the summer.

564Dilara86
mayo 27, 2022, 2:40 am

>563 Petroglyph: I don't even know whether there is a Molière Authorship Question.
Oh yes there is. With some of the same arguments used for Shakespeare - the belief that an actor is too stupid to write their own material. Corneille must have been Molière's ghostwriter. If you can read French, this France Culture article is short and informative.

565librorumamans
mayo 27, 2022, 9:00 am

>563 Petroglyph:

Yeah, that's why I linked to the publisher rather than the newspaper article. By the way, the book will be released in July; I resolved the date issue.

566lilithcat
Editado: mayo 27, 2022, 9:50 am

>564 Dilara86:

the belief that an actor is too stupid to write their own material

That attitude drives me nuts. Only people who know nothing of theatre would think that. Like people who believe in a "modern play production process where a troupe spends a year or more repeating perhaps the same re-running play, or at least gets to rehearse for months before the first staging".

I'm involved with a theatre company and many of our ensemble members write their own material - very successfully, too.

567librorumamans
Editado: mayo 27, 2022, 1:28 pm

>564 Dilara86:

Thanks for that link.

More than fifty years ago, one of my lit. profs was doing computer-assisted stylistic analysis (his period was Elizabethan and Jacobean) on a mainframe that was probably not far removed from drum memory. As I recall, his team were tagging text grammatically using thirty-five or so parts of speech and finding quite distinct patterns for each author, including WS. I'm interested to see how this field has developed since. His approach, even then, seemed much more sophisticated than analysing word frequencies.

568faktorovich
mayo 27, 2022, 1:29 pm

>555 amanda4242: All of the things you are mentioning can be true in isolation without them being tied together as a unified conspiracy. Unless you are saying you know it is a conspiracy from the perspective of the side taking these actions?

569faktorovich
mayo 27, 2022, 1:36 pm

>556 Petroglyph: Since you are asking me to quote back to you what you said, sure, the source you named was: "Von Däniken's... Ancient Aliens". When I searched for this author on this topic, the source I used to check for this phrase came up. You did quote the "exact" phrase "it's possible". If you cannot admit what you have stated on-record, it is indeed impossible to have a rational, logical debate with you. The term "diminutive" means pretty much the same thing in Russian and English grammar; so it is unclear why you are suggesting there is a special comprehension of this term that would be innate in a Russian speaker. The term "booklet" is a technical term that signifies a specific page-length, and not a term that is used to indicate "intimacy" or "familiarity" with "booklet" in question.

570faktorovich
mayo 27, 2022, 1:49 pm

>557 Petroglyph: You guys are attempting to minimize my findings by equating them with nonsensical theories, but it is actually previous authorial-attribution methods (including those that use Stylo) that have been comparably nonsensical and anti-scientific. My methodology is purely scientific and I have proven it with extensive concrete evidence. If you have to digress from actually talking about my findings and into science fiction, this is an indication that you have no basis to discredit my findings on the facts, and so instead you are relying on readers believing you false assertions about it. There is no documented proof of any alien visiting earth, whereas there is overwhelming proof the six ghostwriters I point to lived, wrote, published and had a major impact on British literature even only under their own bylines. Conspiracy theories about far-fetched things like aliens and flat-earth are very likely to be deliberately designed to make all those who uncover massive corruption, fraud and other crimes appear as unbelievable to a public who comes to expect nonsense after reading these outlandish theories. However, I believe that there are members of the public who can tell truth from fiction and nonsense, and my BRRAM series is designed for those who brave facts and the inconvenient for the establishment truth about the overwhelming influence of ghostwriting, plagiarism, and other authorial-credit problems that are rewarding those who purchase bylines to look smart, over those who are smart but lack the means to purchase recognition.

571faktorovich
mayo 27, 2022, 1:59 pm

>566 lilithcat: In my last set of reviews for PLJ, I reviewed a book about the Kennedy assassination. The author (an intelligence specialist involved in the case) explains that Castro had (during a speech) called Kennedy a spoiled, rich incompetent brat who only got ahead by purchasing power (i.e. via ghostwriters), and in response Kennedy decided to assassinate Castro with help from a close friend of Castro who Kennedy (via connections in the CIA) asked to assassinate Castro; Castro's friend seemed to agree with this plot, and shortly afterwards Kennedy was assassinated. There are many actors who took bad actions in this plot, but an actor or a politician who is so concerned about covering up their own incompetence they are willing to kill those who accuse them of using ghostwriters is the worst imaginable villain.

572lilithcat
mayo 27, 2022, 2:02 pm

>571 faktorovich:

Now there's a non sequitur if I ever saw one.

573faktorovich
mayo 27, 2022, 2:04 pm

>567 librorumamans: A scientific method does not need to "seem" "sophisticating", it has to use the exact steps that are necessary to arrive at the accurate answer; and then, scientists must explain the method in a simple manner to allow all readers to replicate it to audit if the method works and if the results can be duplicated by another user. In contrast, a method that "seems" "sophisticating" because the author has used big words and has not shared the data or provided any way for users to replicate to check the results; then, this method should be automatically disqualified, instead of being celebrated because it sounds scientific because nobody can understand what it is trying to say.

574faktorovich
mayo 27, 2022, 2:06 pm

>572 lilithcat: You probably should read the full review, which is available for free: https://anaphoraliterary.com/journals/plj/plj-excerpts/book-reviews-spring-2022/

James H. Johnston, Murder, Inc.: The CIA Under John F. Kennedy (Lincoln: University of Nebraska Press, 2019).

575lilithcat
mayo 27, 2022, 2:08 pm

>574 faktorovich:

And I think you should re-read >566 lilithcat:, because your "response" is completely irrelevant to what I wrote.

576Petroglyph
mayo 27, 2022, 5:44 pm

>569 faktorovich:

Ancient Aliens is the name of that entire conspiracy theory -- where did I say it was a book? You misinterpreted that as a book title, found a book that I did not cite, decided to take a paraphrase literally and pretend that this means you've "won".

Whatever, Faktorovich. "I am not going to engage in repetitive insult-comedy." >523 paradoxosalpha: knew exactly what I meant, and their opinion means much more to me.

The term "diminutive" means pretty much the same thing in Russian and English grammar

Except that Russian uses it a lot more, has many more ways of forming one, and does so productively. For common nouns and proper nouns.

English doesn't really offer the morphological toolkit to do stuff like this out-of-the-box.

577Petroglyph
mayo 27, 2022, 5:47 pm

>572 lilithcat:

I was recently watching this documentary... I read this article the other day...

The associative reasoning of the conspiracy theorist -- where shenanigans anywhere make the suspected shenanigans in this particular case just feel so much more likely.

578Petroglyph
mayo 27, 2022, 6:57 pm

>570 faktorovich:

I have proven it with extensive concrete evidence {...} There is no documented proof of any alien visiting earth, whereas there is overwhelming proof the six ghostwriters I point to lived, wrote, published and had a major impact on British literature even only under their own bylines

But Faktorovich, the Nazca lines are real! Those Iraqi batteries are real! The Val Camonica petroglyphs depicting astronauts are absolutely real! Are you implying that the Antikythera mechanism is a fake? That's all "extensive concrete evidence" as well! The vast, huge, mind-boggling size of the cosmos, the ~100 billion stars just in our galaxy, and at least that number of planets just in our galaxy make it very very unlikely that the only intelligent life in existence is on this one single planet that we happen to live on. Are you disagreeing with any of this?

Both you and Von Däniken have massive amounts of evidence -- many books's worth. Both are laughably wrong, for exactly the same reasons. Your nonsense is also far-fetched and anti-scientific. Both Von Däniken and you clothe yourself in some semblance of "science" to sound convincing -- you tout your citations and the books you self-publish, for instance, and your PhD, and your so-called data.

Conspiracy theories about far-fetched things like aliens and flat-earth are very likely to be deliberately designed to make all those who uncover massive corruption, fraud and other crimes appear as unbelievable to a public who comes to expect nonsense after reading these outlandish theories.

Because this makes you not sound like a conspiracy theorist at all. Are you a victim of COINTELPRO? Are the people here all shills on the payroll of Operation Mockingbird?

However, I believe that there are members of the public who can tell truth from fiction and nonsense, and my BRRAM series is designed for those who brave facts and the inconvenient for the establishment truth about the overwhelming influence of ghostwriting, plagiarism, and other authorial-credit problems

Meanwhile, Ancient Aliens theorists say things like this: "But for some who are brave enough to question the status quo, a great mystery has yet to be solved, one that could in the future change the way humanity understands its place on earth and in the cosmos at large..."

Your booklets have nothing on the implications of Ancient Aliens!

579Petroglyph
mayo 27, 2022, 6:58 pm

>564 Dilara86:

Interesting article, thanks! I want to read up on this now!

580faktorovich
mayo 27, 2022, 8:30 pm

>576 Petroglyph: If English "doesn't really offer" a given grammatical application; then, you should not attempt to use it in English, but rather only use it in the languages where it has the desired meaning.

581faktorovich
mayo 27, 2022, 8:32 pm

>577 Petroglyph: And perhaps I am always watching new documentaries and reading new articles and books, and that's why I am citing the sources I read most recently. You would notice this is the rational explanation if you looked over the hundreds (approaching a thousand) book reviews I have previously done that are available for free on Anaphora's website.

582faktorovich
mayo 27, 2022, 8:34 pm

>578 Petroglyph: You are just yelling nonsense that is unrelated to my research. So, I will just repeat: my research is purely scientific, and accurate.

583susanbooks
mayo 28, 2022, 9:52 am

>582 faktorovich: "my research is purely scientific, and accurate."

Well, given the integrity, consistency, and quality of your posts, I'd say case closed!

584faktorovich
mayo 28, 2022, 8:29 pm

>583 susanbooks: Yes, it is indeed a closed case, as my integrity, consistency and quality is beyond reproach of those who also follow these principles.

5852wonderY
mayo 29, 2022, 6:48 am

Then we’re done here.

586faktorovich
mayo 29, 2022, 9:26 am

>585 2wonderY: I don't know what you guys have been attempting to do here, or why you might now be done "here". I continue to be here to answer questions about my research, if any more questions happen to come up.

587paradoxosalpha
Editado: Jun 3, 2022, 6:35 pm

Analysis of an ignorant taproot of anti-Stratfordiansm:
https://antigonejournal.com/2022/05/shakespeare-greek-latin/
“small Latin and less Greek”

588faktorovich
Jun 3, 2022, 9:10 pm

>587 paradoxosalpha: Tom Moran performs a curious analysis, but it would all make a lot more sense if he had read BRRAM. The lines that identify this section as a typical example of Jonson's satirical "rivalry" with "Shakespeare", under whose byline he also wrote; thus, it would be Jonson's rivalry with himself unless he specified he was satirizing his "Shakespeare" writing-partner, Percy; he does so in this passage by pointing to "Shakespeare" being better than "Lyly... Kyd... or Marlowe"; my 27-tests indicated that all tested "Lyly" and "Kyd"-bylined and all but one of the "Marlowe"-bylined texts were ghostwritten by Percy. I spend a significant section of my chapters about Jonson on explaining the self-puffery of himself and well as of his "Shakespeare"-bylined plays across Jonson's several satirical notes about "Shakespeare". And the joke about Latin and Greek is still missed in Moran's explanation. In most of Percy's self-attributed plays, he includes chunks of Latin text that he is quoting from classical sources without translating these into English; in most of "Shakespeare" plays such sections (as Moran explains) are translated into English, so the original sources are not as obvious and appear to be part of the tapestry of the dramas. So in this remark Jonson is possibly confessing that he helped Percy by translating some of these quotes into English from Latin/Greek, whereas when Percy wrote more independently in the last decade of his life, he just left these in their original languages. The latter part of Jonson's passage that Moran starts with refers to Percy's passion for borrowing content from Greco-Roman classics, such as specifically "Euripides, and Sophocles" in his Latin fragments. If "Shakespeare"-Percy, did or did not know Latin/Greek is relevant to proving how his style differed from Jonson's. "Shake-spear" was not a real person; it is a pseudonym; so it is absurd to argue about if this fictional character knew or did not know Greek or Latin; or to argue that his lack of Greek or Latin would make this fictional character less likely to have actually been an author.

589susanbooks
Jun 4, 2022, 3:30 pm

>588 faktorovich: "it would all make a lot more sense if he had read BRRAM"

Oh my, it just doesn't end.

590Matke
Editado: Jun 4, 2022, 7:39 pm

>588 faktorovich:
I don’t think I’ve ever read such a convoluted explanation of a simple sentence.

You’ve got to be joking.

591faktorovich
Jun 4, 2022, 8:38 pm

>589 susanbooks: What doesn't end? I have made an accurate re-attribution of the British Renaissance, and I am answering any questions that are posted in this forum about my conclusions. Please do not tell folks to stop posting questions because you want whatever it is you are speaking about to "end". If folks have questions (whatever they might be) I am going to address them, and I welcome all questioners.

592faktorovich
Jun 4, 2022, 8:42 pm

>590 Matke: It is far from a "simple sentence" because it linked to article with a very complex but lacking alternative explanation of the Jonson-"Shakespeare" rivalry. I responded to this linked article and not merely to the sentence posted in this forum. My BRRAM series is not a "simple" re-attribution of the Renaissance, but rather one that is very complex as it pulls together pieces of evidence that jointly confirm my linguistic conclusions. So, if you closely read my explanation and the relevant parts of the linked-to article, you should arrive at an understanding of this particular point that I explain in more depth and with more complexity in BRRAM.

593Matke
Editado: Jun 5, 2022, 7:03 am

I don’t see questions in >588 faktorovich: , >589 susanbooks: , or >590 Matke:.

But you do you, Doctor F.

I have a question now, however. In >591 faktorovich: , replying to susanbooks, you say (exact quote here)

Please do not tell folks to stop posting questions because you want whatever it is you are speaking about to end.

My question is this: do you have any evidence at all that susanbooks has told, or asked, others to stop posting here? Or is it your general feeling that this has been done in some Library Thing conspiracy to silence you?

In either case, there are numerous groups here, any one (or more) of which you can join and start your very own thread. That way you can have many virtual avenues to disseminate your research and theories. That would be better than one thread which takes time to load. You don’t have to jump through any technical hoops; simply look through groups, find one you think is suitable, click in the upper right corner on “Join Group” and then start your thread.

Just a helpful suggestion.

594Matke
Jun 5, 2022, 7:01 am

Re:>593 Matke:
You can find groups by searching a book title or author. On the left sidebar, you’ll see “Groups” and something like “Discussions.” Clicking on either of those will bring up multiple choices. Just click on through.

Happy posting.

595faktorovich
Jun 5, 2022, 10:19 am

>593 Matke: How else do you imagine a statement like "Oh my, it just doesn't end." can be interpreted if not as a request for others to stop asking me questions? And why would anybody who is simply tired of a long discussion posting outcries regarding its length if not to indicate to others to stop posting? If somebody is simply tired by the length of a discussion, the logical solution would be to just not post yet another response. A discussion is a place where there are more than a single voice, so starting a thread where I would be the only voice, or a threat of my "very own" would be counter-productive to my efforts to discuss my research with the public.

596faktorovich
Jun 5, 2022, 10:23 am

>594 Matke: I am currently only working on completing my British Renaissance Re-attribution series; I do not have time or interest in any other discussions. BRRAM changes the history of the world, so it needs all of my attention to be polished into the most convincing and through group of books possible.

5972wonderY
Jun 5, 2022, 11:01 am

**eye roll**

598Matke
Editado: Jun 5, 2022, 1:28 pm

>596 faktorovich: I think I didn’t make myself clear.

When you start a thread, you can title it something very similar to this. You can leave a link here to the new thread. Many, perhaps most, threads here are in fact conversations between and among readers with similar interests.

People start threads all the time for very specific, limited subjects. (Example: there’s a new thread that just started for a group read of Miss Mackenzie Several people have joined, and I’m sure that more will come on board.And the only thing discussed will be that book, and social and historical issues which pertain to that book.) So you could do the same thing with a catchy title like “Changing the literary history of the world, or something else that suits your needs. I can almost guarantee that it would attract numerous people who want to discuss that subject.

Surely if you have the time to post lengthy and thorough responses here, you can spare the very few minutes (truly, very few) it would take you to set up a thread to welcome any and all who would be interested in such a topic.

Your choice, of course. I’m just suggesting a way to easily spread the word about your work.

599lilithcat
Editado: Jun 5, 2022, 5:45 pm

>593 Matke:

Honestly, I'm not sure why you are suggesting that faktorovich start a new thread in which to discuss her theory. That really is what this thread is for.

See >1 amanda4242:: "Come read her argument, and tell us what you think!"

600susanbooks
Jun 5, 2022, 3:13 pm

>596 faktorovich: "BRRAM changes the history of the world"

reposted without comment

601Matke
Jun 5, 2022, 6:14 pm

>599 lilithcat:
She expressed concern that people here were being discouraged from posting. I simply offered an alternative, where she could perhaps get more exposure for her ideas.

And this thread is beginning to load v quite slowly.

602lilithcat
Jun 5, 2022, 8:08 pm

>601 Matke:

And this thread is beginning to load v quite slowly.

This thread can be continued. (Indeed, it is a continuation of an earlier thread.)

603Matke
Editado: Jun 5, 2022, 8:33 pm

>602 lilithcat:
I know. Again, I was just suggesting another idea.

ETA
It honestly never entered my mind that anyone would read that suggestion in a different way.

I didn’t mean that she should leave this thread…that would be silly, since the two threads are about her work. I just thought an additional thread might be useful for her.

604faktorovich
Jun 5, 2022, 8:44 pm

>598 Matke: I am here in this discussion because I did an interview with LibraryThing, and the interviewee created this thread to welcome people to comment on my research (without telling me she would do so), as you can see in the first posting in this thread. I do not chat in any other forums on LibraryThing or elsewhere online because I need all of my time to create the 24 or so volumes that will be included in BRRAM when it is finished. I am not going to create a separate chat, if so many people have already comments on BRRAM in this thread and since this thread is already addressing the exact "topic" that I am researching. So, I repeat that I am here (and only here) to answer any questions anybody might have about BRRAM.

605Matke
Jun 5, 2022, 10:00 pm

>604 faktorovich:
Thank you for your response.

606faktorovich
Jun 6, 2022, 8:40 am

>605 Matke: You are welcome. And all questions are welcomed.

607Crypto-Willobie
Jul 11, 2022, 7:41 pm

607

608ljbryant
Jul 11, 2022, 8:14 pm

> 607 Gah! You've resurrected this conversation!

That is not dead which can eternal lie, and with strange aeons even death may die.

609faktorovich
Jul 12, 2022, 10:33 am

>608 ljbryant: You all remain welcome to resurrect this discussion by asking further questions. I just returned from presenting my BRRAM research at the Imaginarium conference in KY, where it stirred up some scholarly agitation as well. The second half of BRRAM is around half-done, as I will probably finish it in January 2023, and this second half provides even more significant confirming proof of my linguistic findings. I should finish translating "Restitution" in the coming week or so, before starting the translation of Harvey's composition textbooks from the years when he first became a Cambridge Rhetoric professor. There is plenty of new things to discuss as I carry this project to completion.

6102wonderY
Jul 12, 2022, 2:28 pm

Really?

https://www.entertheimaginarium.com/wp-content/uploads/2022/07/Imaginarium2022_P...

Where is that presentation in the program? Who are some of your academic peers there who might be knowledgeable about the subject?

611Keeline
Jul 12, 2022, 5:16 pm

>610 2wonderY:

See pages 5, 10, 11, 12

James

6122wonderY
Jul 12, 2022, 5:34 pm

>611 Keeline: Ah! I missed seeing the workshop on page 11.

613Keeline
Editado: Jul 12, 2022, 7:14 pm

>612 2wonderY:

The description in the program book that you linked to shows that this is a fan / writer /artist convention, not an academic conference.

It also has "no harassment" policy which probably means that there is not any serious objection from the audience.

The Imaginarium Convention has always strived to maintain a welcoming and inclusive environment, and we have a no-harrassment policy in place to back that up, to make sure our atmosphere is one that you thrive in!

I'm sure it is a fun event but not really a meeting of a meeting of "scholars."

Even in a 90-minute presentation, there would not be much opportunity for someone who is new to the field to take it all in and see where there could be areas of concern. Someone who has already done some attribution studies might have some opinions but without time behind a computer with some texts of known and unknown authorship, it is all hypothetical.

James

614faktorovich
Jul 12, 2022, 9:13 pm

>613 Keeline: If any of you have a specific conference in mind where I must present to give access to my research to the top scholars in attribution studies, give me the name of this conference, and I'll look into applying to present. Imaginarium contacted me after I sent a general pitch to festivals/ conferences with an offer for me to be a Featured Guest and thus to sponsor my hotel, one dinner and a free exhibit table for my books. Since I am currently semi-retired as an independent researcher, these add up to around $700 savings, over paying for the hotel and also registration etc. at a conference like MLA (where I presented in at least three conferences) where a table would have cost around an extra $3,000; as is the entire Imaginarium project cost me $250 (and I would have spent $100 for that week if I had stayed home). Additionally, aside for Volumes 1-2, the translations in Volumes 3-14 are all of Percy's plays and poetry; this led to Wikipedia's editors to be confused that my research re-attributes all of "Shakespeare" only to Percy (missing the other 4 ghostwriters to whom I attribute at least some of the "Shakespeare" texts); and in general, I understand how these satirical fictions can be misinterpreted as my own creative output of a reviewer does not read closely enough (or only glances at the cover) that these are first-ever translations from Early Modern English. The second half of the series, which I should finish in around 6 months, will include heavier non-fiction translations that should make confusion about BRAMM's nature less likely. If you guys specify where I should pitch to present this research at that point, I would be delighted to do so. There was a lot of interest in "Shakespeare", the Renaissance, ghostwriters, and whodunnit at the Imaginarium, so among those who visited my table there was an audience who was very interested in my findings.

There were no objections to my BRRAM research either during my 90-minute Workshop or when I explained my findings at my table. There were objections from the audience and fellow penalists when I instead discussed the corruption of the modern publishing industry in the two panels where I was asked to speak about publishing and editing; I did not discuss my attribution research at these, but rather my research into current problems of purchased/ bot reviews, and unfair and secretive selection practices. Having done a lot of presentations, I think it is a much more useful panel for the audience when controversial topics are raised openly, as opposed to sitting through a panel where the same material is recycled (as those who have set through such editing/ publishing panels might have noticed). Given that the drive there and back took 30 hours, and I was exhibiting or presenting across the entire 3 days 10am-6pm or so, and I spent a day explaining BRRAM to professors/ media/ librarians, it was 6 days of full-time labor, and not a "fun event". I explained my research at the exhibit first, so there were many specific in-depth questions about my findings at the Workshop. There were also statisticians, British literature professors and other specialists who visited my table who also had many insightful questions, or more so than any questions I have received before during my previous presentations/ exhibits. You guys just keep striving to find something negative in any positive that I mention in this discussion. Isn't this exhausting? Aren't you tempted to focus on progressing forward, instead of ridiculing concluded events?

615Keeline
Jul 13, 2022, 1:42 am

>614 faktorovich:

Based on past reading of articles related to stylometric analysis that I have done over the past 20+ years, it seems that the leading journal in the field is Computational Linguistics

https://cljournal.org/

https://en.wikipedia.org/wiki/Computational_Linguistics_(journal)

Computational Linguistics is a quarterly peer-reviewed open-access academic journal in the field of computational linguistics. It is published by MIT Press for the Association for Computational Linguistics (ACL). The journal includes articles, squibs and book reviews. It was established as the American Journal of Computational Linguistics in 1974 by David Hays and was originally published only on microfiche until 1978. George Heidorn transformed it into a print journal in 1980, with quarterly publication. In 1984 the journal obtained its current title. It has been open-access since 2009.

As noted there, it is related to the Association for Computational Linguistics and they have several conferences.

https://www.aclweb.org/portal/

This is where you would find people who work in the field who would be prepared to assess and discuss the details of your method far beyond any of us could, though we might try. You don't seem to value our observations but perhaps people who have worked in the field extensively for many years would be of greater value.

There is a massive difference between an academic conference and a fan convention. The audience who is attracted will not be specialized in the field at a convention. They can be fun and worth attending but they are not a substitute for an academic conference. I have presented at both types of events for about 30 years though not on the topics of computational linguistics of literary computing.

Occasionally I deal with authorship attribution but it is from the perspective of documentary evidence that is contemporary to the texts involved. The one time I did do something with the QSUM method, it was an effort to see if it could say anything interesting about the texts with which I was familiar with the extrinsic (documentary evidence) for the authorship.

James

616faktorovich
Jul 13, 2022, 9:18 am

>615 Keeline: Dear James: I don't know how long it has been since you checked the website for the Association. But their conferences are not a viable option because none of them have a CFP for any 2022 or 2023 conference meeting. Most of their conferences were last held over a decade ago. The main conference was last held in Thailand in 2021, and there do not appear to be any plans for a new meeting this August. Their European branch meets every 3-4 years. And the North American branch last met in 2019. Do you have another conference in mind that is likely to actually happen in the coming year or two? And I definitely would not manage to travel to Thailand on $250, so events outside of North America do not fit with my minimalist budget preference. I have presented my research at 18 different scholarly and creative conferences before; most were scholarly, but occasionally I receive travel funding to present at mostly creative conferences (which all also have some scholarly panels and presentations); whereas scholarly conferences are typically covered in cost with a grant by the college where I am teaching (which is not an option now that I'm independent). As I have explained across this thread, my re-attributions of the Renaissance are exactly correct. I explain how previous researchers in this field have been incorrect in their findings across BRRAM. And the point of going to a computational linguistics specializing conference would be to explain to these other researchers in person that their findings have been wrong and to explain how my method and findings are accurate, since they are supported by handwriting and financial evidence I present across the series. With this in mind, I welcome any ideas on where I should go to deliver this message to the insiders in this field.

617Keeline
Jul 14, 2022, 2:59 pm

>616 faktorovich:

Some groups are bad at updating their website if they don't do the work themselves. Hiring someone to provide new content becomes a burden.

Just as likely these days is that they have been slow to start up in-person conferences with the changing requirements. The Popular Culture Association, where I have been presenting regularly since 1992, cancelled 2020, delayed 2021 to a later month and held it virtually, and held 2022 in the usual month virtually.

A fan mystery convention called Bouchercon is trying to do its first in-person event in Minneapolis later this year. It will be in San Diego next year. Of course some events which have met in person have found that some people tested positive after coming home. Probably they are not gravely ill as a result but when there is a positive, the contact tracing spreads the word. This makes organizers more timid to avoid their event becoming a news item.

So not seeing a 2022 or 2023 conference doesn't surprise me very much in these uncertain times.

Going to a conference is always an expensive venture. I've always had to pay my way to PCA and other events I go to. No one pays me, comps rooms, or contributes to travel expenses. This is true even when I was area chair for our section and had to do a lot more work before and during the event beyond merely attending or making a presentation. It is simply the way of such things. For those who earn their living in academia, of which I am not, there's a certain expectation of publishing and conference presentation if one wants to maintain or advance their position at an institution. Thus the economics are set up to meet that "publish or perish" environment. A fan convention might try to get participation from authors whose attendance might attract regular attendees to the event. In those cases they may provide an honorarium or help with travel expenses. It depends on the event. But it is not typical unless one is a big name that will draw participants.

A far lower cost than a conference is to submit a paper to a peer-review journal such as the one I linked to in the first place. Under most handling of this, the submitted article (without the author's name on it) is reviewed by one or several readers for a critical reading. When possible, these readers (usually volunteers) are people who work in the field and can provide an experienced evaluation of the article that helps the editor to decide if it should be published as is, with edits, or not at all. This is the ideal and it doesn't always work and sometimes the easiest path is to decline an article that doesn't fit easily.

I suppose that you've encountered this journal before as you look at other writings in the field of stylometric analysis. Thus I don't expect that it is new to you. But you asked for publications and conferences with people working in the field so I mentioned it again at the risk of telling you of something that is already familiar.

James

618coprime
Jul 14, 2022, 4:21 pm

If you Google "north america computational linguistics conference" the first result is the webpage for their 2022 conference, which is being held in Seattle and online this week.

619faktorovich
Jul 14, 2022, 8:55 pm

>617 Keeline: I began meeting people in Louisville on the 7th, and I still haven't seen any symptoms of Covid; it takes 2-14 days for it to show up, so I guess I won't know with certainty if I caught it until a week from now. In my experience, across all of the conferences I have done (with exhibits/ lectures etc.) I have never caught the flu/ cold, whereas I catch it every time I start teaching college after a break, and tend to catch it if I go on a cross-ocean plane trip. I think this means folks with 3 shots like me are probably safe to resume conferencing unless they party pretty hard beyond the standard meetings.

A virtual convention should be a lot cheaper, so that sounds good, if those are still going to be happening in the future years. I asked only about conferences, as I have not done an in-depth search for these assuming the few I have heard about are the major ones (I think I have previously looked into the Association you featured and their conferences). As I have mentioned in this thread, I submitted my research to pretty much every somewhat relevant journal in the fields of computational-linguistics, Renaissance, drama, literature, etc. This took less time than it otherwise might have because I had around 35 different scholarly articles between my Renaissance and 18th century books, so covering hundreds of journals (each holding them exclusively for 1-6 months or so) took a couple of years. I have not submitted any of my ongoing research since I decided to self-publish BRRAM with Anaphora over a year ago. My submissions did lead to a few publications in Critical Survey, Journal of Information Ethics etc., but I mostly hit a wall in a review process where reviewers are entirely unbound by any ethical rules and give irrational or erroneous reasons for rejections (if any). I will probably try one more round of submissions to academic book publishers for the second half of the series after I finish "Restitution" in the coming couple of weeks.

620faktorovich
Jul 14, 2022, 8:57 pm

>618 coprime: Thanks for finding it. I can't imagine why it's not included on their list of conferences. Either way, I think today was the last day of meetings for the event. And did you find a place where they advertise a CFP for next year?

621coprime
Jul 15, 2022, 1:41 pm

>620 faktorovich: No, my two seconds of Googling did not pull anything up about the 2023 conference, and I don't care to put more effort than that into searching. The third result if you Google "north america computational linguistics conference" is a call for papers for the 2022 conference, which was put out in October 2021. So my guess would be that more info about the 2023 conference will be available by simple Googling in a few months, if the 2023 conference follows a similar timeline as the 2022 conference.

622faktorovich
Jul 15, 2022, 8:50 pm

>621 coprime: I think the real answer is that I am doing all that is physically possible to get my research out there, under the restraint of my need to keep the cost as close as possible to zero. Traveling abroad to this particular potential 2023 conference (assuming it will be in person and 9/10 likely abroad) is not a rational option. I merely asked for suggestions regarding which conferences I should present in because you guys objected that the conference I went with was "creatively"-minded. Since you guys cannot come up with a better non-"creative" option, that should answer your concerns regarding why I could not find this better option either. It would be even better if I was wrong, and there was a better option I had not considered, but being right works as well (as I am happy to stay home to focus on the research vs. spending more time selling it to the public).

623Keeline
Jul 16, 2022, 12:59 am

>622 faktorovich:

you guys objected that the conference I went with was "creatively"-minded

No. Observations were made that an academic "conference" is very different from a fan "convention". It has nothing to do with whether it is "creative." The description overall and the items on the program show that it is not a gathering of academics but rather people who aspire to be artists and writers. If there were 1,000 participants, how many of these have engaged in computer-based stylometric studies?

I have to second coprime's comment about the timing. The PCA conference with which I am most familiar meets around Easter (typically April) and has its call for papers at the end of summer and ending around October. Acceptance usually is determined by December. So a July conference having a deadline of October but not as early as 12 months out is an expected pattern.

You have to be the keeper of your own finances but don't expect free publicity for your research. You have to put in not only time but resources. I still think that submitting an article is probably the way to go if you can't afford to travel without subsidy.

James

624faktorovich
Jul 16, 2022, 12:43 pm

>623 Keeline: Out of "1,000 participants", I was among the top-6 guests of honor/ hosts who received some funding, whereas everybody else sponsored their own way. And many others were probably rejected when they applied to present. The other 5 featured guests bestselling/ award-winning creators. If you think me presenting as one of these top-6 at an event is insufficient in terms of doing the most with free publicity, you have a very ambitious perspective on academia/ publishing. At least three participants came up to my booth and asked me complex questions about computational-linguistics, saying they do research in related fields. And others were teachers/ professors/ librarians who had researched the Renaissance etc. The point of a conference is to share ideas with others in a field, when they might not otherwise stumble on findings in the jungle of the web. The types of panels Imaginarium had is similar to those at the Virginia Festival of the Book or the Tucson Festival of Books, where I also presented as a featured speaker. I scheduled to check for the CFP at the start of September, since you guys seem to believe this is the only conference that would confirm the worthiness of my research. I am not conserving my resources on marketing because I cannot afford a trip to Thailand etc., but rather because any money I spend means I might have to go back to "work" teaching college or the like sooner, whereas I currently have enough savings to work only on my research for at least a couple more years. Being insensitive towards somebody else's minimalist savings ethics is not a polite approach. And I might take out an ad or the like to advertise the series after I finish its second half in half-a-year. It would cost about as much to travel to present in Thailand, as purchasing a full-page ad in New York Times for a few weeks.

625coprime
Jul 16, 2022, 2:16 pm

>624 faktorovich:

you guys seem to believe this is the only conference that would confirm the worthiness of my research

I don't really care where you do or don't present your research. I was just pushing back against your assumptions in >616 faktorovich:

But their conferences are not a viable option because none of them have a CFP for any 2022 or 2023 conference meeting. Most of their conferences were last held over a decade ago. The main conference was last held in Thailand in 2021, and there do not appear to be any plans for a new meeting this August. Their European branch meets every 3-4 years. And the North American branch last met in 2019.

with info that I was able to find with a very basic Google search. You can do with that information what you want.

If you want other conferences, I found this list of computational linguistics conferences for 2022 & 2023 with five seconds of Googling. (When I searched "north america computational linguistics conference 2023" it was the third result.) Many of the conferences are international, but they may have an online portion. There are also a handful of conferences listed as taking place in Los Angeles, San Francisco, and New York.

Again, you can do with that information what you want; I have no investment in whichever conferences or conventions you do or don't attend.

626faktorovich
Jul 16, 2022, 9:06 pm

>625 coprime: The brevity of your search is one clue that your results are not likely to be useful. This list of seemingly different conferences links to formulaic pages that only change the locations of the events, without making any edits to the general call-for-papers; they do not even change the due-date for submission. It is very likely that this conference is similar to pay-to-play journals that just make a profit from registration without investing in filtering or organizing content/ the event. It is irrelevant if you care about my access to conferences or not; what is relevant is if you guys are again posting advertisements for events/ journals/ resources that are potentially fraudulent, or un-useful; as usual I have to point out any potential bad actors so that those reading this thread will not follow any bad advice.

627coprime
Jul 16, 2022, 10:41 pm

>626 faktorovich:

The brevity of your search is one clue that your results are not likely to be useful.

¯\_(ツ)_/¯

628faktorovich
Jul 17, 2022, 11:15 am

>627 coprime: To repeat again, I am here to answer questions about my research. I do not need any help with performing the research or selling it. If you have questions please post them, and I am delighted to answer them.

629faktorovich
Jul 20, 2022, 4:20 pm

My BRRAM article on theatrical fraud in the British Renaissance, “Manipulation of Theatrical Audience Size” has been published in Critical Survey: https://www.berghahnjournals.com/view/journals/critical-survey/34/3/cs340306.xml This covers objections regarding why my research is not available in more mainstream scholarly journals. You guys can now access this article, and criticize or benefit from it. If anybody wants a free review copy of this article, you can email me at director@anaphoraliterary.com (just as with the rest of BRRAM) and I'll email a pdf to you for free.

630faktorovich
Oct 3, 2022, 9:03 pm

I expanded my computational linguistic study by adding 18 additional texts to increase the total of texts in the corpus from 284 to 302. All 18 of these texts represent 18 additional bylines that increase the total number of bylines in the corpus from 104 to 122. And there are still clearly only 6 collaborative ghostwriters among this group. Most of the added texts fit in the Byrd group, as I had some questions I wanted to answer before beginning the translation of a volume from this group for BRRAM. I posted the updated data tables, with a new bibliography and an updated chronology for the Byrd group on the GitHub page: https://github.com/faktorovich/Attribution I believe some of you have asked about the small size of the Byrd group previously, so now you can see the larger patterns to Byrd's usage - including some of the rare 3-word phrases that only appear in Byrd's writing. Questions are welcomed as usual. There are many other elements this new data explains, as I will detail in the forthcoming Byrd volume.

631faktorovich
Dic 10, 2022, 1:35 pm

As I approach finishing the Byrd volume of BRRAM, I added one more text to the corpus (the primarily Byrd-ghostwritten and "Spenser"-assigned "Shepherds' Calendar") for a total of 303 texts. You can see the updated files on the main page (they are the newest uploads): https://github.com/faktorovich/Attribution I also uploaded an updated version of the handwriting comparison document that now includes handwriting samples from all six ghostwriters and their pseudonyms. I should finish one more volume (Sylvester's) and will publish the remainder or the second half of the BRRAM series in the coming couple of months.

632faktorovich
Feb 13, 2023, 9:51 pm

Este mensaje ha sido denunciado por varios usuarios por lo que no se muestra públicamente. (mostrar)

The Second Half of the British Renaissance Re-Attribution and Modernization Series Published

Learn About the Six Ghostwriters Publishing Monopoly Conspiracy.

Dear LibraryThing Users:

Anaphora Literary Press is excited to offer you the opportunity to review or write a story about the newly released final 6 volumes of the British Renaissance Re-Attribution and Modernization Series (BRRAM): https://anaphoraliterary.com/attribution. This website includes press coverage in papers such as the Times Record News, and Midwest Book reviews. Email me at director@anaphoraliterary.com to request free review copies.

Volumes 15-17 are from “Part II: Attribution Mysteries” of the series that provides evidence to re-attribute the “William Cavendish”-bylined Variety comedy to Benjamin Jonson, the “Samuel Brandon”-bylined Virtuous Octavia tragicomedy to Gabriel Harvey, and variedly-bylined (including pieces from “Shakespeare”, “Raleigh” and “Dyer”) verse in an anthology to William Byrd. And Volumes 18-20 are from “Part III: The Self-Attributed Texts of the Ghosts” that translates and introduces texts that prove these ghostwriters’ style or writerly patterns match those of the other bylines in their groups. Part III covers Richard Verstegan’s first Old English dictionary and Anglo-Saxon mythology-foundation, A Restitution for Decayed Intelligence in Antiquities (1605), Gabriel Harvey’s Latin (and now translated into English for the first time) satire about pseudonyms, Smith: Or, The Tears of the Muses (1578) and Josuah Sylvester’s first and only verse English translation of the Book of Job and the absurd fantasy Woodman’s Bear (1620). Each of these books is designed with questions for discussion, synopses, contextual introductions, and other components to assist teachers and casual readers.

This series is cataloged in the World Shakespeare Bibliography and in the Play Index (EBSCO). A few sections out of BRRAM have been published in scholarly journals. “Manipulation of Theatrical Audience-Size: Nonexistent Plays and Murderous Lenders” was published in Critical Survey, Issue 34.1, Spring 2022. “‘Michael Cavendish’s’ 14 Airs in Tablature to the Lute (1598)” was published in East-West Cultural Passage, Volume 22, Issue 2, December 2022. The Journal of Information Ethics published two articles on Faktorovich’s re-attribution method: “Publishers and Hack Writers: Signs of Collaborative Writing in the ‘Defoe’ Canon” (Fall 2020) and “Falsifications and Fabrications in the Standard Computational-Linguistics Authorial-Attribution Methods: A Comparison of the Methodology in ‘Unmasking’ with the 28-Tests” (Spring 2022). The computational data, handwriting comparisons across bylines, diagrams and other content to assist further research is publicly accessible at: https://github.com/faktorovich/Attribution.

Benjamin Jonson’s The Variety (1649): Volumes 15: Softcover: 224pp, 6X9”, $22: 979-8-375805-12-2; Hardcover: $27: 979-8-375806-34-1; Kindle: $9.99; Overdrive EBook: 978-1-68114-571-6; LCCN: 2023932017; Nonfiction—Drama—European—English, Irish, Scottish, Welsh. Release: February 8, 2023. https://www.amazon.com/dp/B0BV1DDSKW

A fragmentary comedy about the corruption of the judicial and monarchical systems in charge of granting aristocratic titles based on appearance instead of merit.

This comedy includes several devices that are uniquely typical of Jonson’s authorial style, including the extraordinary number of five marriages in the resolution, and the intricate descriptions of the significance of outward appearance (in dance, clothing, makeup and gossip) in distinguishing anybody in Britain as superior or inferior. At the onset of the plot, Sir William is hoping to marry the wealthy-widow, Lady Beaufield, to gain access to her fortune. In parallel, Simpleton’s wealthy-widow Mother is hoping to marry a knight so she can gain the aristocratic title of a Lady. Meanwhile, Simpleton is courting Beaufield’s daughter, Lucy, who clearly favors her other suitor, Newman. Simpleton devises several schemes to win an advantage by hiring jeerers to ridicule Newman, as well as hiring Voluble to give Newman a false prophecy to manipulate him toward whoring and drinking. By the end, Simpleton even attempts to kidnap Lucy to force her into marriage. In the background of these various courtships, the French dance teacher, Galliard, is tutoring his wealthy students in dance. And Voluble and Nice are teaching proper manners, dress and other outward signs of aristocratic breeding in their Female Academy. These seemingly silly and pretty tropes are clouding the fact that Galliard confesses he has escaped being executed for attempting to overthrow the French King in 1632, and Voluble is repeatedly accused of witchcraft. More importantly, the narrative explains the corrupt process that was involved in bribing judges and administrators into allowing a wealthy gentry landowner, like Mother, to purchase her way into the aristocracy through a vacant baronet title. Mother merely has to choose between going through the ladyfying schooling herself, or completely negating her burden by hiring an actress, such as Nice (the chambermaid), to pretend to be her in public appearances. The dialogue refers to several people who were granted aristocratic titles by this corrupt process, starting with the 1st Lord of Lorne of Scotland in 1439, and as late as the Duke of Buckingham in 1623. Many of the contextual references mention the Percys’ Northumberland estate’s Scottish neighbors, as well as other Percy-associated places and people in the Buckingham Palace and Newcastle; thus, this play is likely to have been closeted by Percy until after his death because Jonson was criticizing the Percys’ involvement in these title-purchasing schemes. Percy (as the primary ghostwriter) and Jonson (as the secondary) had written about knighthood-purchasing and James I’s trade in titles to his Scottish and Scottish-adjacent comrades in Eastward Ho! These frank confessions about corruption in the monarchy led to Jonson’s temporary imprisonment in 1605. This volume includes translations of all of Jonson’s authentic letters. These include the letter he wrote in 1605, during this Eastward imprisonment, wherein Jonson asks Percy to help free him from being implicated in seditious remarks that he claims were Percy’s portion of the composition. The annotations across Variety provide a myriad of scholarly revelations, supported with precise evidence. One of these is new proof for the misdating for several antique-like forgeries of broadsheet ballads. Introductory sections explain why this play has been mis-attributed to “William Cavendish”, and the complex biographical overlaps between the Jonson and “John Donne” bylines and handwriting styles. The historical introduction to the types of dance-instructors Variety is satirizing is assisted by the translation from French into English of fragments from Apologie de la Danse or Apology for the Dance by “Par F. de Lauze” (1623).

Gabriel Harvey’s The Tragicomedy of the Virtuous Octavia (1598): Volume 16: Softcover: 250pp, 6X9”, $23: 979-8-375807-88-1; Hardcover: $28: 979-8-375808-09-3; Kindle: $9.99; Overdrive EBook: 978-1-68114-572-3; LCCN: 2023932029; Nonfiction—Drama—European—English, Irish, Scottish, Welsh. Release: February 8, 2023. https://www.amazon.com/dp/B0BV2RRKG7

The first English self-labeled “tragicomedy” about Octavia’s failed attempts to win back her inconstant husband, Antony, from his Egyptian lover, Cleopatra, and to prevent her brother, Octavius, from waging retaliatory war on Antony and Cleopatra.

This volume presents overwhelming evidence for the re-attribution of the “Samuell Brandon”-bylined The Virtuous Octavia (1598) to Gabriel Harvey. The introduction raises questions about potential attribution leads and revealing relevant sources, which are answered with the evidence in the “Primary Sources” section that includes: three letters exchanged between William Byrd and Harvey while both were teaching at Cambridge, the “Octavia to Anthony” poetic epistle from the Arundel Harington Manuscript, and fragments from Plutarch’s “Mark Antony” chapter. The “Exordium” includes sections that present revealing clues in seemingly mundane details, such as this play’s typesetting. Another introductory section explains how Gerard Langbaine created the first “Brandon” biography solely based on the evidence presented in the Virtuous play, and without any evidence to support that “Brandon” was indeed a real author, and not merely a fictitious pseudonym. The imaginative process Langbaine used to manufacture “Brandon’s” biography is used to explain how scholars have communally arrived at the erroneous current attributions for the texts of the British Renaissance. A section on Harvey’s literary style explains how the texts Harvey ghostwrote differ from the patterns seen in the other Workshop ghostwriters’ texts. Another section presents visual examples of Harvey’s handwriting in his signed annotations on Domenichi’s Facetie, on “J. Harvey’s” A Discursive Problem Concerning Prophesies, and on Nicolai Machiavelli Princeps, and matches these to the handwriting styles currently assigned to two bylines Harvey ghostwrote under: “Edmund Spenser’s” poem on a copy of Sabinus’ Poëmata and “Elizabeth I’s” letter in Italian to Don Ferdinando de Medici, Grand-Duke of Tuscany. Another section explains how the two dedications to “the virtuous… Mary Thynne” and “the virtuous Lady Lucia Audley” are subversive clues that explain Virtuous Octavia as Harvey’s rebuttal to Percy’s at first anonymous and later “Shakespeare”-bylined Romeo and Juliet (1597). Romeo’s plot has long been suspected to be grounded in the contemporary story of Mary Thynne’s marriage to a member of a rival family, as well as the subsequent violence and litigations over this star-crossed-marriage between Mary’s mother, Lady Audley, and other members of their two clans. And a section on imitation-clusters explains that Virtuous Octavia falls into several sub-genre clusters that turn into an original formula when they are mixed together. These clusters include imitations and translations of the French dramatist Robert Garnier; adaptations of historical plotlines from Plutarch’s Lives; and imitations of Seneca’s tragedies. One of the latter tragedies by Seneca is also called Octavia, and it is about Emperor Nero’s wife of this same name, which had been translated into English by “T. N.” back in 1581. There are also explanation for the seemingly deliberately misdated historical details, such as the mixed references to events that involved M. Marcellus (270-208 BC; 5-time Consul) and G. Marcellus (88-40 BC; 1-time Consul; first husband of Octavia). And sections summarize Virtuous Octavia’s critical reception, give ideas to directors on approaches to its staging, and present an extensive synopsis of its narrative. This verse tragicomedy begins after the Treaty of Tarentum has been signed, renewing the power-split of Roman territories between three Emperors: Octavia’s brother Octavius is awarded the West, Octavia’s husband Antony is awarded the East, and Lepidus receives Africa. Octavia receives news that Antony is living with Cleopatra. When Octavia attempts to bring military reinforcements and to speak with Antony to convince him to return to her, Antony refuses to allow her to come near him. The news of this infidelity enrages Octavius, who decides it is an affront on his own honor, and uses it as a pretext to wage war against Antony, despite Octavia’s continuing petitions for peace and reconcilement. Civil and foreign wars are raging in the background, but most of the play focuses on Octavia’s philosophical and emotional struggle to comprehend why Antony has chosen to sin, and how she is stoically determined to remain constant and virtuous. In a brief mention in the resolution, Cleopatra causes Antony’s tragic death by tricking him into believing she has killed herself, before indeed killing herself. In the forefront of this conclusion, Octavia explains why she continues to be committed to virtuous conduct, despite all that has happened, and to take care of Antony’s children, even when she has to do so outside of Antony’s house (from which he has forcefully evicted her).

A Comparative Study of Byrd Songs: Volume 17: Softcover: $27, 380pp, 7X10”: 979-8-375810-38-6; Hardcover: $32: 979-8-375810-63-8; Kindle: $9.99; Overdrive EBook: 978-1-68114-573-0; LCCN: 2023932225; Literary Collections—European—English, Irish, Scottish, Welsh. Release: February 8, 2023. https://www.amazon.com/dp/B0BV1KB38Q

A comparative anthology of all of the variedly-bylined texts in William Byrd’s linguistic-group, with scholarly introductions that solve previously impenetrable literary mysteries.

This is a comparative anthology of William Byrd’s multi-bylined verse, with scholarly introductions to their biographies, borrowings, and structural formulas. The tested Byrd-group includes 30 texts with 29 different bylines. Each of these texts is covered in a separate chronologically-organized section. This anthology includes modernized translations of some of the greatest and the wittiest poetry of the Renaissance. Some of these poems are the most famous English poems ever written, while others have never been modernized before. These poems serve merely as a bridge upon which a very different history of early British poetry and music is reconstructed, through the alternative biography of the single ghostwriter behind them. This history begins with two forgeries that are written in an antique Middle English style, while simultaneously imitating Virgil’s Eclogues: “Alexander Barclay’s” claimed translation of Pope Pius II’s Eclogues (1514?) and “John Skelton’s” Eclogues (1521?). The next attribution mystery solved is how only a single poem assigned to “Walter Rawely of the Middle Temple” (when Raleigh is not known to have been a member of this Inn of Court) in The Steal Glass: A Satire (1576) has snowballed into entire anthologies of poetry that continue to be assigned to “Raleigh” as their “author”. Matthew Lownes assigned the “Edmund Spenser”-byline for the first time in 1611 to the previously anonymous Shepherds’ Calendar (1579) to profit from the popularity of the appended to it Fairy Queen. And “Thomas Watson” has been credited with creating Hekatompathia (1582), when this was his first book-length attempt in English; and this collection has been described as the first Petrarchan sonnet sequence in English, when actually most of these poems have 18-line, instead of 14-line stanzas. Byrd’s self-attributed Psalms, Sonnets, and Songs (1588) includes several lyrics that have since been re-assigned erroneously to other bylines in this collection, such as “My Mind to Me a Kingdom Is” being re-assigned to “Sir Edward Dyer”. The Byrd chapter also describes the history of his music-licensing monopoly. The “University Wit” label is reinterpreted as being applied to those who completed paper-degrees with help from ghostwriters, as exemplified in “Robert Greene’s” confession that “his” Pandosto and Menaphon were “so many parricides”, as if these obscene topics were forced upon him without his participation in the authorial process. “Philip Sidney’s” Astrophil and Stella (1591) is showcased as an example of erroneous autobiographical interpretations of minor poetic references; for example, the line “Rich she is” in a sonnet that puns repeatedly on the term “rich”, has been erroneously widely claimed by scholars to prove that Sidney had a prolonged love-interest in “Lady Penelope Devereux Rich”. Similarly, Thomas Lodge’s 1592-3 voyage to South America has been used to claim his special predilection for “sea-studies”, in works such as Phillis (1593), when adoring descriptions of the sea are common across the Byrd-group. Alexander Dyce appears to have assigned the anonymous Licia (1593) to “Giles Fletcher” in a brief note in 1843, using only the evidence of a vague mention of an associated monarch in a text from another member of the “Fletcher” family. One of the few blatantly fictitiously-bylined Renaissance texts that have not been re-assigned to a famous “Author” is “Henry Willobie’s” Avisa (1594) that invents a non-existent Oxford-affiliated editor called “Hadrian Dorrell”, who confesses to have stolen this book, without “Willobie’s” permission. Even with such blatant evidence of satirical pseudonym usage or potential identity-fraud, scholars have continued to search for names in Oxford’s records that match these bylines. “John Monday’s” Songs and Psalms (1594) has been labeled as one of the earliest madrigal collections. 1594 was the approximate year when Byrd began specializing in providing ghostwriting services for mostly university-educated musicologists, who used these publishing credits to obtain music positions at churches such as the Westminster Abbey, or at Court. An Oxford paper-degree helped “Thomas Morley” become basically the first non-priest Gospeller at the Chapel Royal. The section on “Morley’s” Ballets (1595) describes the fiscal challenges Morley encountered when the music-monopoly temporarily transitioned from Byrd’s direct control to his. “John Dowland’s” First Book of Songs or Airs (1597) is explained as a tool that helped Dowland obtain an absurdly high 500 daler salary from King Christian IV of Denmark in 1600, and his subsequent equally absurd willingness to settle for a £21 salary in 1612 to become King James I’s Lutenist. And the seemingly innocuous publication of “Michael Cavendish’s” 14 Airs in Tablature to the Lute (1598) is reinterpreted, with previously neglected evidence, as actually a book that was more likely to have been published in 1609, as part of the propaganda campaign supporting Lady Arabella Stuart’s succession to the British throne; the attempt failed and led to Arabella’s death during a hunger-strike in the Tower, and to the closeting of Airs. “William Shakespeare’s” The Passionate Pilgrim (1599) has been dismissed by scholars as only containing a few firmly “Shakespearean” poems, in part because nearly all of its 20 poems had appeared under other bylines. Passionate’s poems 16, 17, 19 and 20 are included, with an explanation of the divergent—“Ignoto”, “Shakespeare” and “Marlowe”—bylines they were instead assigned to in England’s Helicon (1600). Scholars have previously been at a loss as to identity of the “John Bennet” of the Madrigals (1599), and this mystery is solved with the explanation that this byline is referring to Sir John Bennet (1553-1627) whose £20,000 bail, was in part sponsored with a £1,200 donation from Sir William Byrd. “John Farmer’s” First Set of English Madrigals (1599) is reinterpreted as a byline that appears to have helped Farmer continue collecting on his Organist salary physically appearing for work, between a notice of absenteeism in 1597 and 1608, when the next Organist was hired. “Thomas Weelkes’” Madrigals (1600) is reframed as part of a fraud that managed to advance Weelkes from a menial laborer £2 salary at Winchester to a £15 Organist salary at Chichester. He was hired at Chichester after somehow finding around £30 to attain an Oxford BA in Music in 1602, in a suspicious parallel with the Dean William Thorne of Chichester’s degree-completion from the same school; this climb was followed by one of the most notorious Organist tenures, as Weelkes was repeatedly cited for being an absentee drunkard, and yet Dean Thorne never fired him. “Richard Carlton’s” Madrigals (1601) also appears to be an inoffensive book, before the unnoticed by scholars “Mus 1291/A” is explained as torn-out prefacing pages that had initially puffed two schemers that were involved in the conspiracy of Biron in 1602. The British Library describes Hand D in “Addition IIc” of Sir Thomas More as “Shakespeare’s only surviving literary manuscript”; this section explains Byrd’s authorship of verse fragments, such as “Addition III”, and Percy’s authorship of the overall majority of this censored play; the various handwritings and linguistic styles in the More manuscript are fully explained. “Michael Drayton’s” Idea (1603-1619) series has been explained as depicting an autobiographical life-long obsession with the unnamed-in-the-text “Anne Goodere”, despite “Drayton’s” apparent split-interest also in a woman called Matilda (1594) and in male lovers in some sprinkled male-pronoun sonnets. “Michael East’s” Second Set of Madrigals (1606) is one of a few music books that credit “Sir Christopher Hatton” as a semi-author due to their authorship at his Ely estate; the many implications of these references are explored. “Thomas Ford’s” Music of Sundry Kinds (1607) serves as a gateway to discuss a group of interrelated Jewish Court musicians, included Joseph Lupo (a potential, though impossible to test, ghostwriter behind the Byrd-group), and open cases of identity-fraud, such as Ford being paid not only his own salary but also £40 for the deceased “John Ballard”. “William Shakespeare’s” Sonnets (1609) are discussed as one of Byrd’s mathematical experiments, which blatantly do not adhere to a single “English sonnet” formula, as they include deviations such as poems with 15 lines, six couplets, and a double-rhyme-schemes. The poems that have been erroneously assigned to “Robert Devereux” are explained as propaganda to puff his activities as a courtier, when he was actually England’s top profiteer from selling over £70,000 in patronage, knighthoods and various other paper-honors. “Orlando Gibbons’” or “Sir Christopher Hatton’s” First Set of Madrigals and Motets (1612) describes the lawsuit over William Byrd taking over a Cambridge band-leading role previously held by William Gibbons, who in retaliated by beating up Byrd and breaking his instrument. This dispute contributed to Byrd and Harvey’s departure from Cambridge. Byrd’s peaceful life in academia appears to be the period that Byrd was thinking back to in 1612, as he was reflecting on his approaching death in the elegantly tragic “Gibbons’” First songs.

Richard Verstegan’s A Restitution for Decayed Intelligence in Antiquities (1605): Volume 18: Softcover: 506pp, 7X10, $31: 979-8-375813-50-9; Hardcover: $36: 979-8-375814-03-2; Kindle: $9.99; Overdrive EBook: 978-1-68114-574-7; LCCN: 2023932018; Nonfiction—History—Europe—Great Britain. Release: February 8, 2023. https://www.amazon.com/dp/B0BV3VB35X

The launch of Britain’s “Anglo-Saxon” origin-myth and the first Old English etymological dictionary.

This is the only book in human history that presents a confessional description of criminal forgery that fraudulently introduced the legendary version of British history that continues to be repeated in modern textbooks. Richard Verstegan was the dominant artist and publisher in the British Ghostwriting Workshop that monopolized the print industry across a century. Scholars have previously described him as a professional goldsmith and exiled Catholic-propaganda publisher, but these qualifications merely prepared him to become a history forger and multi-sided theopolitical manipulator. The BRRAM series’ computational-linguistic method attributes most of the British Renaissance’s theological output, including the translation of the King James Bible, to Verstegan as its ghostwriter. Beyond providing handwriting analysis and documentary proof that Verstegan was the ghostwriter behind various otherwise bylined history-changing texts, this translation of Verstegan’s self-attributed Restitution presents an accessible version of a book that is essential to understanding the path history took to our modern world. On the surface, Restitution is the first dictionary of Old English, and has been credited as the text that established Verstegan as the founder of “Anglo-Saxon” studies. The “Exordium” reveals a much deeper significance behind these firsts by juxtaposing them against Verstegan’s letters and the history of the publication of the earliest Old English texts to be printed starting in 1565 (at the same time when Verstegan began his studies at Oxford). Verstegan is reinterpreted as the dominant forger and (self)-translator of these frequently non-existent manuscripts, whereas credit for these Old English translations has been erroneously assigned to puffed bylines such as Archbishop Parker and the Learned Camden’s Society of Antiquaries. When Verstegan’s motives are overlayed on this history, the term “Anglo-Saxon” is clarified as part of a Dutch-German propaganda campaign that aimed to overpower Britain by suggesting it was historically an Old German-speaking extension of Germany’s Catholic Holy Roman Empire. These ideas regarding a “pure” German race began with the myth of a European unified origin-myth, with their ancestry stemming from Tuisco, shortly after the biblical fall of Babel; Tuisco is described variedly as a tribal founder or as an idolatrous god on whom the term Teutonic is based. This chosen-people European origin-myth was used across the colonial era to convince colonized people of the superiority of their colonizers. A variant of this myth has also been reused in the “Aryan” pure-race theory; the term Aryan is derived from Iran; according to the theology Verstegan explains, this “pure” Germanic race originated with Tuisco’s exit from Babel in Mesopotamia or modern-day Iraq, but since Schlegel’s Über (1808) introduced the term “Aryan”, this theory’s key-term has been erroneously referring to modern-day Iran in Persia. Since Restitution founded these problematic “Anglo-Saxon” ideas, the lack of any earlier translation of it into Modern English has been preventing scholars from understanding the range of deliberate absurdities, contradictions and historical manipulations behind this text. And the Germanic theological legend that Verstegan imagines about Old German deities such as Thor (Zeus: thunder), Friga (Venus: love) and Seater (Saturn) is explained as part of an ancient attempt by empires to demonize colonized cultures, when in fact references to these deities were merely variants of the Greco-Roman deities’ names that resulted from a degradation of Vulgar Latin into early European languages. Translations of the earlier brief versions of these legends from Saxo (1534; 1234?), John the Great (1554) and Olaus the Great (1555) shows how each subsequent “history” adds new and contradictory fictitious details, while claiming the existence of the preceding sources proves their veracity. This study also questions the underlying timeline of British history, proposing instead that DNA evidence for modern-Britons indicates most of them were Dutch-Germans who migrated during Emperor Otto I’s reign (962-973) when Germany first gained control over the Holy Roman Empire, and not in 477, as the legend of Hengist and Horsa (as Verstegan satirically explains, both of these names mean horse) dictates. The history of the origin of Celtic languages (such as Welsh) is also undermined with the alternative theory that they originated in Brittany on France’s border, as opposed to the current belief that British Celts brought the Celtic Breton language into French Brittany when they invaded it in the 9th century. There are many other discoveries across the introductory and annotative content accompanying this translation to stimulate further research.

Gabriel Harvey’s Smith: Or, The Tears of the Muses (1578): Volume 19: Softcover: 250pp, 7X10”, $24: 979-8-375816-71-5; Hardcover: $29: 979-8-375817-17-0; Kindle: $9.99; Overdrive EBook: 978-1-68114-575-4; LCCN: 2023932026; Nonfiction—Biography & Autobiography—Criminals & Outlaws. Release: February 8, 2023. https://www.amazon.com/dp/B0BV1DDMNN

A poetic satire of ghostwriters being hired to write puffery of and by patrons and sponsors, who pay to gain immortal fame for being “great”, while failing to perform any work to deserve any praise.

This volume shows the similarities across Gabriel Harvey’s poetic canon stretching from his critically-ignored self-attributed Smith (1578), his famous “Edmund Spenser”-bylined Fairy Queen (1590), and his semi-recognized “Samuel Brandon”-bylined Virtuous Octavia (1598). This close analysis of Smith is essential for explaining all of Harvey’s multi-bylined output because Smith is an extensive confession about Harvey’s ghostwriting process. Harvey’s Fairy Queen is his mature attempt at an extensive puffery of a monarch, which has been (as Harvey predicted in Smith and Ciceronianus) in return over-puffed as a “great” literary achievement by monarchy-conserving literary scholars across the past four hundred years. The relatively superior in its condensed social message and literary achievement Smith has been ignored in part because the subject of its puffery appears trivial from the perspective of national propaganda. Smith: Or, The Tears of the Muses is a metered poetic composition that can also be performed as a multi-monologue play. The central formulaic structure is grounded in nine Cantos that are delivered by each of the nine Muses; this formula appeared in many British poems and interludes after its appearance in “Nicholas Grimald’s” translation of a “Virgil”-assigned poem called “The Muses” in Songs and Sonnets (1557). The repetitive nature of this puffing formula is subverted not only by the satirical and ironic contradictions that are mixed with the standard exaggerated flatteries of “Sir Thomas Smith” (Elizabeth’s Secretary), but also with several seemingly digressive sections that puff and satirize other bylines, including “Walter Mildmay” (King’s Councilor) and “John Wood” (“Smith’s” copyist and nephew). The central subject of the satire in Smith is Richard Verstegan’s career as a goldsmith, who forged antiques, and committed identity fraud that included ghostwriting books under multiple bylines, including passing himself (as Harvey points out) as at least two different “Sir Thomas Smiths”. The introduction to this volume includes matching handwritten letters that were written by Smith #1 (who died in 1577) and Smith #2 (who died in 1625) and by Verstegan under his own byline. In Smith’s conclusion, Verstegan responds with ridicule of his own directed at Harvey. This is the first full translation of Smith from Latin into English. The accompanying introductory matter, extensive annotations, and class exercises hint at the many scholarly discoveries attainable by researchers who continue the exploration of this elegant work.

Josuah Sylvester’s Job Triumphant in His Trial and The Woodman’s Bear (1620): Volume 20: Softcover: 202pp, 6X9”, $22: 979-8-375822-54-9; Hardcover: $27: 979-8-375822-67-9; Kindle: $9.99; Overdrive EBook: 978-1-68114-576-1; LCCN: 2023932025; Nonfiction—Bibles—Other Translations—Text. Release: February 10, 2023. https://www.amazon.com/dp/B0BVCXJHFH

The first verse English translation of the Book of Job, and a fantasy epic poem about the woeful love between the Woodman and the Bear.

Computational, handwriting, and other types of evidence proves that Josuah Sylvester ghostwrote famous dramas and poetry, including the first “William Shakespeare”-bylined book Venus and Adonis (1593), the “Robert Greene”-bylined Orlando Furioso (1594) and the two “Mary Sidney”-assigned translations of Antonie (1592) and Clorinda (1595). Sylvester is also the ghostwriter behind famously puzzling attribution mysteries, such as the authorship of the anonymous “Shakespeare”-apocrypha Locrine (1595), and behind controversial productions such as the “Cyril Tourneur”-bylined Atheist’s Tragedy (1611). All of the famous texts that Sylvester ghostwrote have previously been modernized and annotated. In contrast, most of Sylvester’s many volumes of self-attributed works have remained unmodernized and thus inaccessible to modern scholars. This neglect is unwarranted since under his own name, Sylvester served as the Poet Laureate between 1606-12 under James I’s eldest son, Henry Frederick, Prince of Wales. This volume addresses this scholarly gap by translating two works that capture Sylvester’s central authorial tendencies. As “John Vicars’” poetic biography argues, Sylvester was a “Christian-Israelite” or a Jew who converted to Christianity, which caused his exile from his native England and his early death abroad. Sylvester’s passion for his Jewish heritage is blatant in the percentage of texts in his group that are based on books in the Old Testament, including the “George Peele”-bylined Love of King David (1599) and the “R. V.”-bylined Odes in Imitation of the Seven Penitential Psalms (1601). This volume presents the first Modern English translation of the only verse Early Modern English translation of the Book of Job. The original Hebrew version’s dialogue is in verse, so that it can be sung or recited during services, and yet there still have not been any scholarly attempts to translate the Old Testament, from versions such as the Verstegan and Harvey-ghostwritten King James Bible, into verse to better approximate this original lyrical structure. Sylvester precisely translates all of the lines and chapters of Job, adding detailed embellishments for dramatic tension and realism. In the narrative, God is challenged by Lucifer to test if Job would remain loyal to God even if he lost his wealth and other blessings; God accepts the challenge and deprives Job of all of his possessions, his family, as well as his health. Job is devastated, but he remains humble and continues to have faith in God. Job’s faith is further challenged by extensive lectures from his friends, who accuse him of suffering because God has judged him to be sinful and in need of punishment. Sylvester also specialized in dreamlike rewriting and remixing of myths from different cultures, as he does in Orlando Furioso, where the narrative leaps between Africa and India, and warfare leads Orlando to go insane. The title-page of Sylvester’s Woodman’s Bear warns readers of a similar trajectory with the epithet: “everybody goes mad once”. In this epic, Greco-Roman-inspired, mythological rewriting, a Woodman has proven to be uniquely resistant to Cupid’s love-arrows, so Cupid disguises himself in a Bear and makes both the Bear and the Woodman fall into desperate love for each other, out of which the Woodman only escape with a magic potion. Woodman’s Bear has been broadly claimed to have been Sylvester’s autobiographical account of a failed courtship, but the analysis across this volume reaches different conclusions and raises ideas for further inquiry.

Your opinion is greatly appreciated, so please contact me (here or via email) with any questions or comments.

Thank you for your time,

Anna Faktorovich, Ph.D., Director
Anaphora Literary Press
https://anaphoraliterary.com
director@anaphoraliterary.com

633Crypto-Willobie
Feb 14, 2023, 9:48 pm

633

634faktorovich
Editado: Feb 21, 2023, 3:54 pm

>633 Crypto-Willobie: You have no response to the release of six additional volumes in BRRAM, with their added overwhelming evidence in support of my computational-linguistics findings? Instead you just want to type a number to increase your post-count? And you guys have chosen to block the preceding post where I describe the new discoveries I have made?

635Crypto-Willobie
Feb 22, 2023, 5:35 pm

i had nothing to do with flagging yer post.

636paradoxosalpha
Feb 22, 2023, 5:45 pm

>635 Crypto-Willobie:

The Stratfordian "Will" in your handle is a giveaway. You're obviously part of the conspiracy of "us guys."

637Petroglyph
Feb 22, 2023, 6:33 pm

>635 Crypto-Willobie:
>636 paradoxosalpha:

There's more layers! It's no coincidence that "Crypto-Willobie" is a pun for "hidden penis". Clearly an in-joke and a Shakespearean meta-reference to those in the know.

638amanda4242
Editado: Feb 22, 2023, 6:55 pm

>632 faktorovich: LT has a very strict policy regarding self-promotion, which is why this post was flagged. As this thread was created to discuss an interview about your ~~conspiracy theories~~ series, you could probably make a case to staff that the flags should be removed. You can email LT staff at info@librarything.com; do be sure to include a link to the message itself.

>636 paradoxosalpha: & >637 Petroglyph: You are both being ridiculous. "Crypto-Willobie" is clearly a veiled reference to "Willoughby," the cad in Sense and Sensibility. Obviously Crypto-Willobie is the secret author of the works of Jane Austen!

639lilithcat
Feb 22, 2023, 7:20 pm

>638 amanda4242:

Nonsense. It's a clear reference to the Willow Song from Othello, thus proving that Crypto-Willobie is William Shakespeare.

640amanda4242
Editado: Feb 22, 2023, 7:35 pm

>639 lilithcat: Balderdash! It is a variant of Willoughby, which the author also used in their other pseudonymous work, The Wolves of Willoughby Chase. I can prove this mathematically once I find my slide rule and 20-sided die.

641Crypto-Willobie
Feb 23, 2023, 1:09 am

Fascinating!

642MrAndrew
Feb 23, 2023, 4:17 am

>640 amanda4242: Horsefeathers! You're missing the critical clue: "Crypto" is clearly a reference to the crypt of "William Shakespeare" at Stratford Upon Avon (which of course contains merely a pair of gag Groucho Marx glasses), the logical reverse of which is "William Shakespeare's" purported birthplace in a half-timbered Tudor-style house, leading us inexorably to the tv series The Tudors, being based on Henry VIII, whose court was graced by Katherine Brandon, Duchess of Suffolk; also known as Baroness Willoughby de Eresby.

That's right, Crypto-Willobie is actually a sixteenth-century English Duchess and the true ghost-writer of Verstegan's entire catalogue, regardless of spurious attribution.

643faktorovich
Feb 23, 2023, 3:03 pm

>636 paradoxosalpha: Your retorts are absurd, but I choose to use them as an opportunity to share some free excerpts out of the second half of BRRAM (these 6 volumes will be available in the Early Reviewers giveaway that starts in March) to enlighten the public. I explain the curious use of "Will" in "A Comparative Study of Byrd Songs: Volume 17", when translating the following two poems out of William Byrd's "Shakespeare"-bylined "Sonnets" (1609).

135

(1) Whoever has her wish—thou have thy Will,
And Will as well, and Will in over-plus.
I am more than enough, who vexes thee still,
To thy sweet will making addition thus.
Will thou, whose will is large and spacious,
Not once vouchsafe to hide my will in thine?
Shall will in others seem rightly gracious,
And in my will no fair acceptance shine?
The sea is all water, yet receives rain still,
And, in abundance, adds to his store.
So, thou, being rich in Will, add to thy Will
One will of mine to make thy large Will more.
Let no unkind “No” fair beseechers kill;
Think all but one, and me in that one Will.(2)

(1) Line 1 echoes Two Gentlemen of Verona’s (1623: Jonson) Act I, Scene 3, lines 61-3: “As one relying on your lordship’s will/ And not depending on his friendly wish—/ My will is something sorted with his wish.” And there is a similar pun in “Lily’s” (Percy) Mydas’ (1592) Act IV, Scene 3: “The world will grow full of wiles, seeing Mydas has lost his golden wish.”

(2) This poem is a satire about the Workshop ghostwriters’ names. The two Wills among them were William Byrd and William Percy. The “rich” among them was Richard Verstegan. Over the decades, the six ghostwriters worked in varied pairings, or triples, and rarely in isolation. For example, in Letters Between Edmund Spenser and Harvey (1580), Harvey complains: “the Birde will not sing in April, nor in May, may perhaps sing in September… If I could steal only one poor fortnight, to peruse him over afresh, and copy him out anew.” Harvey is referring to Will Byrd’s function in the Workshop as their versifier, or the member responsible for turning loose verse into properly metered and rhymed lines. When Byrd was not available or busy with other tasks, poetic projects had to be suspended, or left in prose or in loose verse. Jonson appears to have also helped with versifying, but he primarily only partnered with Percy, and occasionally helped Byrd. The decrease in the percentage of metered verse in “Shakespeare”-bylined plays after around 1601 indicates that Byrd was saying “No” to solicitations to versify these plays, and Jonson was more interested in writing his own plays. Meanwhile, Rich Verstegan continued to be technically in exile from England in 1609, even as his byline began again appearing in King James’ official printer’s publications; so Rich was in need of hiding himself in estates such as Will Percy’s. The absurdity of the Workshop’s juggling act between the Wills, Rich, and their extremely heavy ghostwriting workload is the subject of this broadly misunderstood poem. Walsh notes that some editors have speculated that there are references “to three Wills—the poet, the friend, and the lady’s husband. Some editors reject the husband, and Mr. Lee (as elsewhere) even the friend.” The female (who might not have been a “Lady”) is only mentioned in the opening line, and she has her “Wish”, not a Will. The Poet, William Byrd, is indeed one of the Wills. The “you” (who might or might not be a “friend”) is not Will himself, but has his 3 Wills (Will Percy: who was probably so overbearing he seemed like three people), and also is Rich (Richard Verstegan = you), and is trying to add more (by hiring Will Byrd). A similar play on the name Will appeared in one of Percy’s early comedic plays, the “R. W.”-bylined Three Lords and Three Ladies of London (1590), which features three young pages, who are called Will, Wit and Wealth, who wittily discuss the meaning of their names, as an elderly merchant, called Simplicity, attempts to seduce them or take one of them home with him; the meaning of this exchange is discussed in this volume of the BRRAM series.

136 (3)

If thy soul checks thee, when I come so near;
Swear to thy blind soul that I was thy Will; (4)
And thy soul knows will is admitted there.
Thus far, for love, my love-suit sweetly fulfill.
Will will fulfill the treasure of thy love;
I will it full with wills, and my will is one;
In things of great reception with ease we prove;
Among a number, one is reckoned as none.
Then, in the number let me pass untold;
Though in thy store’s account I one must be;
Since nothing holds me, so it pleases thee to hold;
What is nothing to me, is something sweet to thee.
Make just my name thy love, and love that still;
And then, thou love me; since my name is Will.

(3) This sonnet is spoken by the capital-letter Will, so it seems to be presented as Will’s unsolicited response to non-capitalized will’s query in Sonnet 135. Whereas Sonnet 135 could be interpreted as a discussion of splitting ghostwriting duties; Sonnet 136 sharpens or redirects the focus onto the sexual or friendly loving relationships between the parties.

(4) Beyond 135-6, there are other poems in Sonnets that pun on the name Will, including Sonnet 143 that also uses a capitalized and italicized form. A capitalized, but non-italicized Will pun also appears in Sonnet 57: “So true a fool is love that in your Will…”. A non-capitalized will pun also appears in Sonnet 89, where will is repeated a few times, but is not personified. A non-capitalized and semi-personified version appears in Sonnet 134: “And I myself am mortgaged to thy will”. And there is a pun on Will in Harvey’s “Shakespeare”-bylined Lucrece’s lines 495-7: “But Will is deaf and hears no headful friends;/ Only he has an eye to gaze on Beauty,/ And dotes on what he looks, against law or duty.”

644faktorovich
Feb 23, 2023, 3:29 pm

>638 amanda4242: The "Crypto-Willobie" byline must be referencing "Henry Willobie's" "Avisa" (1594: Byrd). So, here are excerpts from Volume 17 related to the meaning of this byline:

One of the few blatantly fictitiously-bylined Renaissance texts that have not been re-assigned to a famous “Author” is “Henry Willobie’s” Avisa (1594) that invents a non-existent Oxford-affiliated editor called “Hadrian Dorrell”, who confesses to have stolen this book, without “Willobie’s” permission. Even with such blatant evidence of satirical pseudonym usage or potential identity-fraud, scholars have continued to search for names in Oxford’s records that match these bylines...
Any need for permission from the author was nullified by 1596, when “Dorrell” claimed Willobie had died prior to this point. No birth or death records appear to have survived, and the firmest evidence for “Henry Willobie’s” existence is the entry for the completion of an Oxford BA degree by a “Henry Willoughbie” who first enrolled at the age of 16 in 1591. This particular Henry’s age has led critics such as Robert Prechter to conclude: “tagging a teenager as the author does not fit the sophistication of Avisa”. However, “Dorrell’s” 1596 introduction further indicates that Avisa was written “at least… thirty-five years since”, which would exclude “Willoughbie”, as the authorship would have taken place in around 1561, or long before his birth.
Yet another attribution problem is the blatantly fictitious pseudonyms of the two prefacers. “Abell Emet” might mean in Latin: Abel (son of Adam and Eve) will purchase. Alternatively, “Thomas Abell” was credited as the author of Invicta Veritas: An Answer, That by no manner of law, it may be lawful for the most noble King of England, King Henry the Eight to be divorced from the Queen’s grace, his lawful wife (1532); so the name might be designed to hint at “Abell’s” authorship. And “Contraria Contrariis” means in Latin: Contraries to Contraries. The use of absurd pseudonyms was a standard practice of the Workshop. The Workshop’s standard practice was to create absurd pseudonyms that were funny to insiders who understood advanced linguistics, and then to puff these pseudonyms as if they were real names for real “authors”. Instead, in this case, the absurdity of these pseudonyms was heightened by the publication of an exchange of argumentative contemporary articles regarding if “Willobie” was fictitious or real. It began with “Peter Colse’s” two echoing notes in Penelope’s Complaint (1596) that Avisa was “by an unknown Author”. And it continued with a response from “Dorrell” in a new 1596 edition of Avisa, where he insists “Willobie” is the author’s “true name”. This public airing of fraudulent byline usage concluded in the 1599 decision by the Stationers’ Register for Avisa “to be Called in” for destruction. There are no surviving copies of the 1596 edition, as the surviving “1596”-dated “Apology” was published in the 1605 edition. Such recalling is rare in the Byrd-group. This strange repercussion might have been the outcome of Byrd’s music-poetry monopoly expiring in 1596, after 21 years on the patent, before it was transferred to “Morley” and then to “Barley”. Verstegan, Jonson or other rival publishing monopolists might have attempted to use the expiration of this patent to minimize Byrd’s power in the music-poetry genre, in part by questioning the authenticity of bylines such as “Willobie” in the press to threaten a lawsuit over such bylines’ fraudulence. This campaign against Byrd might have resulted in this direct censorship and recalling of Avisa...

645faktorovich
Feb 23, 2023, 3:35 pm

>639 lilithcat: The "Willow Song" is a genre that Percy and Jonson repeated across many of their dramas. Jonson was the Workshop's chief comedian, while Percy was the chief tragedian; and both of them appear to have found dark humor in the repetition of the word "willow" in a lamenting song. A fragment from Percy's self-attributed "Volume 5: Thirsty Arabia" helps to explain.

THE THIRD SONG.

1.

Gentle Lady, receive our song.
And for the same do not make us long.
We have travelled far and near
To be partakers of you here.
Yet plain you are, it does appear:
Willow, willow, willow,
Willow, willow, willow. (1)
Willow and rue will be our cheer.

2.

We are neither Gentiles nor Jews.
At your own good pleasure, you may choose,
Without obstacle or without disease,
Either of the two, if so you please.
Yet plain by you, it does appear:
Willow, willow, willow,
Willow, willow, willow.
Willow and rue will be our cheer.

3.

What need is there for all our wooing?
Full well we know what is in the doing.
If notes may be read by your face,
Both of us find what will be our grace.
Since plain by you, it does appear:
Willow, willow, willow,
Willow, willow, willow.
Willow and rue will be our cheer.

(1) The repetition of “willow” satirizes the popularity of songs about the willow tree in Elizabethan England. W. Chappell includes a section of willow songs in "Old English Popular Music" (1961), 106-110. The Jonson-ghostwritten and “Shakespeare”-bylined "Othello" also includes the tragic or melodramatic version, “O Willow, Willow”, which Percy’s song is satirizing and plagiarizing. Desdemona tells Emilia that her mother’s maid died while singing the traditional “song of willow”. Desdemona then proceeds to recite the song amidst dialogue. It includes a near identical triple repetition of “willow”, but with the added explanatory word “sing”: “Sing willow, willow, willow.” The song proceeds thus: “I called my love false love, but what said he then? Sing willow, willow, willow.” In Act V, Emilia dies while singing “Willow, willow, willow” and insisting that Desdemona was “chaste”.

646faktorovich
Feb 23, 2023, 3:44 pm

>640 amanda4242: There are variant spellings of "Willobie" just as there are variant spellings of most words in Middle/Old English. The Workshop and others during the Renaissance used such spelling variations in names to commit fraud. For example, "Shakespeare's" name is spelled in 6 different ways in each of the 6 surviving signatures. As I explained in the preceding note, there was a “Henry Willoughbie” who enrolled into Oxford in 1591. Thus, Joan Aiken might have read scholarship that noted this diverging spelling, and thus used a still different variant in her title. My 303 texts also include the "John Wilbye"-bylined "First Set of English Madrigals" (1598), which was the only book with "Madrigals" in its title that matched Percy and Jonson as their main ghostwriters, with Byrd as a secondary; the other madrigal collections were all ghostwritten primarily by Byrd. Both "Willobie" and "Wilbye" were likely to indeed be plays on will-bye or by-willow, and thus the explanations regarding the repetition of "will" and "willow" in poems I previously offered is relevant to understanding the puns in these pseudonyms.

647faktorovich
Feb 23, 2023, 4:05 pm

>642 MrAndrew: My article "'Michael Cavendish’s' '14 Airs in Tablature to the Lute' (1598)" was just published in the "East-West Cultural Passage", Volume 22, Issue 2, December 2022. In this article, I explain that this work's dedication to the "Lady Arbella" was part of a campaign to put Arbella on the British throne, which ended in her imprisonment and death on a hunger-strike. Arbella's family (Cavendishes) was closely connected with the Percys, many of whom were also imprisoned (and some assassinated or executed) in the Tower over sedition charges. Both the Cavendishes and Percys were associated with Suffolk, as I explain in this same article: "The table of contents for 'Airs' includes a woodcut of the Cavendishes’ coat-of-arms. Fellowes concludes that this is sufficient proof that authenticates 'Michael' was a member of the aristocratic Cavendish family. In particular, Fellowes cites Devy’s 'Suffolk Collection' manuscript that appears to indicate that 'Michael' was one of the grandsons of George Cavendish in Suffolk, an affiliate of Cardinal Wolsey." The fate of the Suffolk line is further explained in my "Volume 16: Virtuous Octavia", which was initially held the "Samuel Brandon"-byline (and was ghostwritten by Harvey). This "Brandon"-byline seems to have been used to hint it was authored by a member of the Suffolk clan: "The Brandons held the title of Earls or Dukes of Suffolk for a couple of centuries before this title was forfeited and the Brandon line faded away. The event that brought this line to an abrupt conclusion is when Henry Grey was executed for treason by Mary I for attempting to put Lady Jane Grey on the throne in 1554." Thus, if the similarity to "Willoughby's" name is intentionally in the "Henry Willobie"-byline; then, just as in "Octavia", the Workshop wanted to hint at a rebellious aristocratic authorship, without directly accusing any specific aristocrat of writing a potentially seditious or libelous text. And given the ties of these families to William Percy, it seems Byrd wanted to satirize Percy's character in this work.

648Crypto-Willobie
Feb 23, 2023, 5:52 pm

AF is right abt the source of my moniker though i disclaim many of the acretions she cites.

649faktorovich
Feb 23, 2023, 7:59 pm

>648 Crypto-Willobie: This might be the first time any of you agreed with anything I have stated. I hope you will clarify what specific "acretions" you deny and why.

650paradoxosalpha
Feb 24, 2023, 2:37 am

>649 faktorovich: any of you

You are such a charmer, Dr. F. "We guys" continue to be flattered by your treatment of us as a univocal rabble.

651faktorovich
Feb 24, 2023, 3:45 pm

>650 paradoxosalpha: With a name like "Dr. F.", as you prefer to call me, it is only fitting that I am a tough-grader, especially where my own research is concerned. It would have been a shock if, on top of Crypto-Willobie semi-agreeing with me in the preceding note, another one of you broke away from the song of the "univocal rabble" to find another point of agreement. I do hope that some of you will brave the pull of the crowd in the opposite direction, and will actually address the logic of my statements, instead of only searching for points of disagreement, as suitable response topics. Labeling a scholar as being too teacher-like when the presented research is sophisticating, is a low rebuttal. You would have a stronger position to ridicule the "conspiracy" I have uncovered, if you found anything untrue, or un-supported by evidence in my claims. BRRAM explains the evidence that was visible across the past four centuries, but scholars have not dared to or were not able to notice (due to a shortage of research-time). Thus, I am acting as the rebel by bringing the facts into the public view; while you are self-describing yourselves as "univocal rabble", while issuing mutually-echoing ridicule without being able to find any new points of disagreement to voice. I would be delighted if "you guys" broke away from the echo-chamber and brought completely original objections to the completely original research I just cited. As I am here to learn what the public thinks of my findings, and to respond to the public's questions.

652MrAndrew
Feb 25, 2023, 5:33 am

I'm genuinely interested to know what you have learned in your time here, in regard to what the public thinks of your findings.

653faktorovich
Feb 25, 2023, 6:46 pm

>652 MrAndrew: I learned that the public is deeply committed to the current bylines assigned to Renaissance texts, and is extremely resistant to any alternative proposals. Theories regarding alternative "Shakespeares" are acceptable as curiosities, but the idea that there are five different authorial styles behind the texts currently assigned to bylines such as "Shakespeare" are unpalatable. I also learned that Renaissance dramas and poetry have been so engrained in the school curriculum that it is extremely difficult to correct misunderstandings in these fields. This is why I added a translation of the "Book of Job" from the Old Testament, the first Old English dictionary ("Restitution") and other unique-genre texts to the BRRAM series before finishing it, with the hope that seeing stylistic and structural similarities and considering related re-attributions would be easier in these less pre-programmed genres. BRRAM is now finished and published. It is too late to use feedback to add texts to BRRAM or to otherwise adjust its trajectory. It is now important for me to learn how I can best bring it into the public's consciousness, or explain its validity to somebody for whom its ideas might intuitively seem incomprehensible. For example, you guys reacted to my post regarding the publication of the final six volumes by satirizing the term "conspiracy", or by proposing irrational conspiracy theories of your own in an attempt to frame my findings as a similarly fantastic conspiracy theory. I recently read about the conspiracy theory that the term "conspiracy theory" was created by the CIA to prevent the public from countering its official JFK assassination account: https://apnews.com/article/fact-check-conspiracy-theory-jfk-941578119864 As AP News accurately points out, the term "conspiracy theory" was indeed used as far back as 1863; but even back then it was used with a similarly dismissive purpose of minimizing counter-propagandistic versions of history. The article notes "A 1967 CIA dispatch about the Warren Commission — a group of officials appointed by Lyndon B. Johnson to investigate Kennedy’s assassination by Lee Harvey Oswald — refers to 'conspiracy theories' and 'conspiracy theorists.'" While the article points out that such minimizing propagandistic campaigns that used these terms existed prior to the JFK assassination; it cannot dispute that this 1967 dispatch did exist and did specify that the use of these terms could be used to encourage public-doubt in alternative theories. Thus, by reading all of your comments that suggested absurd conspiracies to minimize my real conspiracy, I gained a better understanding of this propagandistic strategy. I hope this clarifies my continuing interest in further discussion on this forum about BRRAM.

654Matke
Feb 28, 2023, 9:36 am

Factorovich: Are you familiar with the work of Professor Loreto of the University of Rome? He’s developed a computer program that analyzes texts and compares them with one another. He doesn’t take into account style or the message of the text; it’s more down to the DNA of the texts. He worked on uncovering the authors of the books by “Elena Ferrante.”

655prosfilaes
Feb 28, 2023, 10:02 am

>653 faktorovich: I learned that the public is deeply committed to the current bylines assigned to Renaissance texts, and is extremely resistant to any alternative proposals.

I doubt that's true. Personally, I'm not well-read in Renaissance authors. It's just "everything you know is wrong" theories tend to be wrong, and the world I know has a huge number of writers, ranging from those who hit it big to those who never get much of anything for their writing, from those who churn out a book a month to those who write one book in their lifetime. Such a radical change from what I understand of the world and from what the experts in the time period claim is highly suspicious to me.

656faktorovich
Feb 28, 2023, 8:32 pm

>654 Matke: Since you've asked, I did some research. JStor does not include Loreto's article on Ferrante. I found a mainstream mention of this research in Literary Hub: https://lithub.com/have-italian-scholars-figured-out-the-identity-of-elena-ferra... "In the study by Margherita Lalli, Francesca Tria, and Vittorio Loreto (Sapienza University of Rome), a data-compression algorithm wrongly attributes Ferrante’s 'Troubling Love to Starnone' and Starnone’s 'Eccesso di zelo, Via Gemito' and 'Prima esecuzione' to Ferrante." The author of this article reaches a biased conclusion that Loreto's re-attribution is "wrong" simply because the conclusion disagrees with the accepted bylines, or with the author's sense of what is wrong or right. This article and you don't explain the nature of Loreto's algorithm. The precise elements this algorithm examines and how it does so is what would determine if the attributions reached with this method are reached "wrongly" or rightly. Given the different attribution methods I have reviewed and my findings regarding the dominance of ghostwriting, I intuitively believe bylines are likely to have nothing to do with who actually writes texts. My intuitive bias is thus leaning in the opposite direction of the article's author, who wants to believe all bylines are true until they are proven otherwise with a direct confession from non-other than the author. This is why it is not a good idea to reach attribution decisions based on intuition or on light research into the general method employed. It is important to apply a combination of tests (like my 27-tests method) to a large corpus with multiple bylines, and then find support for these findings beyond the data. Loreto's algorithm is only the raw data, and he has not attempted to find internal, external, documentary or biographical, etc. additional types of evidence to support what the numbers indicated. Unlike Loreto's study, BRRAM's 20 volumes find overwhelming proof beyond the data, as well as analyzing with the largest quantity of tests and the largest number of texts attempted up-to-now. It would be logical if Loreto or somebody else created an algorithm that incorporated my 27+ tests, but as far as I know such programmers are too busy to respond to queries regarding how their programs should be improved. Feel free to invite him or any other programmer to email me at director@anaphoraliterary.com if he/they wants my help.

657faktorovich
Feb 28, 2023, 8:43 pm

>655 prosfilaes: It is a fact that nearly all "everything you know is wrong" theories must be wrong. However, when there are such theories that are right, they change world-history. Discerning between these two groups is indeed difficult because one must sift through 99.99999% of the wrong theories to find the .000001% that is right about preceding theories being wrong. Since the British Renaissance, there have been 400 years of incorrect attribution theories for this period, and now I have proposed the first correct re-attributions. It is indeed (also) a disturbing thought that, for example, between the three Bronte sisters, there are only two linguistic signatures, thus proving (at minimum) that something about the Bronte authorship biographical narrative is fictitious. It is frightening to imagine that I have spent 12 years in K-12, and then 8 more years in BA-PhD reading British literature by the Great Authors, and this has allowed me to understand the field well enough to learn that most of this corpus is ghostwritten. It basically means that some of the best writers in the world have been forced or manipulated the market by using pseudonyms, and that this has meant greater writers might not have had access to the press and died in obscurity. While all this is disturbing (in contrast with the established historical narrative), it is the truth. And human intellectual progress depends on the knowledge of the truth, as opposed to the joyful continuation of ignorant falsehoods.

658faktorovich
mayo 22, 2023, 11:40 pm

I finished computational-linguistic testing on 310 texts from the British 18th century, and reached some preliminary groupings for these texts: https://github.com/faktorovich/Attribution/blob/master/18th%20Century%20British%... In the next 3 months or so, I will run an experiment on another 300 or so texts from the British 19th century before starting a book to explain the re-attributions for both centuries. There initially appear to be 21 different linguistic signatures in the 18th century, but some of these clusters clearly overlap, so the total number of signatures will shrink as I examine the corpus more closely. There is clearly yet again a small Workshop of ghostwriters who were working across the 18th century in Britain, just like there was a small circle during the Renaissance. Comments or ideas on the data are welcomed here or at director@anaphoraliterary.com, while its interpretation is still a work in progress.

659faktorovich
Ago 5, 2023, 4:22 pm

Preliminary results of my computational-linguistic tests of 18th and 19th century British texts (632 texts in these corpuses), with the top-two most likely underlying 10/11 ghostwriters for each linguistic group highlighted in bold (based on birth, death, publication etc. dates). The next step will be biographical and other types of research to evaluate these quantitative conclusions. Input is welcomed! 18: https://github.com/faktorovich/Attribution/blob/master/18th%20Century%20British%... 19: https://github.com/faktorovich/Attribution/blob/master/19th%20Century%20British%...

660faktorovich
Mar 26, 1:51 pm

I completed the final 18th and 19th century handwriting analysis and re-attribution handbook, which the public can view here: https://drive.google.com/file/d/1_FWao8ZZod8jJbAKch2Y7ngy5VX2EUwt/view?usp=shari... And I have uploaded final stylometric attribution conclusions files, under the titles “18th Century British - Full Summary Table…” and “19th Century British - Publication Information”, to https://github.com/faktorovich/Attribution I am now working on researching documentary and historical evidence to support my conclusions in the second volume of a series, for which Routledge offered a publishing contract. I have presented on my findings at several conferences across this past year, including RMMLA and SAMLA, and parts of this new study were accepted/published in periodicals including SAMLA's journal. All questions and comments about these findings are welcomed here or at director@anaphoraliterary.com

Who Really Wrote the Works of the British Renaissance? thread 2

CharlasTalk about LibraryThing

1amanda4242mayo 3, 2022, 2:20 pm

2amanda4242Editado: mayo 3, 2022, 2:27 pm

Vota: Do you believe this is a fair summary of the previous thread?

3PetroglyphEditado: mayo 3, 2022, 2:55 pm

4amanda4242mayo 3, 2022, 2:56 pm

5lilithcatmayo 3, 2022, 3:01 pm

6paradoxosalphamayo 3, 2022, 3:08 pm

7Petroglyphmayo 3, 2022, 4:11 pm

8abbottthomasmayo 3, 2022, 6:08 pm

92wonderYmayo 3, 2022, 6:26 pm

10faktorovichmayo 3, 2022, 10:09 pm

11Petroglyphmayo 3, 2022, 10:35 pm

12prosfilaesmayo 3, 2022, 10:43 pm

13Keelinemayo 3, 2022, 10:54 pm

14PetroglyphEditado: mayo 4, 2022, 12:57 am

15prosfilaesmayo 3, 2022, 11:32 pm

16Keelinemayo 3, 2022, 11:55 pm

17Petroglyphmayo 4, 2022, 2:35 am

18anglemarkEditado: mayo 4, 2022, 11:24 am

19faktorovichmayo 4, 2022, 10:55 am

20faktorovichmayo 4, 2022, 10:58 am

21faktorovichmayo 4, 2022, 11:05 am

22faktorovichmayo 4, 2022, 11:17 am

23anglemarkEditado: mayo 4, 2022, 11:46 am

24faktorovichmayo 4, 2022, 11:51 am

25faktorovichmayo 4, 2022, 11:54 am

26faktorovichmayo 4, 2022, 11:58 am

27Keelinemayo 4, 2022, 1:25 pm

28Keelinemayo 4, 2022, 1:37 pm

29Petroglyphmayo 4, 2022, 2:11 pm

30Petroglyphmayo 4, 2022, 2:25 pm

31PetroglyphEditado: mayo 4, 2022, 2:48 pm

32Petroglyphmayo 4, 2022, 2:46 pm

33Petroglyphmayo 4, 2022, 5:48 pm

34Petroglyphmayo 4, 2022, 6:52 pm

35prosfilaesmayo 4, 2022, 8:09 pm

36faktorovichmayo 4, 2022, 8:47 pm

37faktorovichmayo 4, 2022, 8:55 pm

38faktorovichmayo 4, 2022, 9:10 pm

39faktorovichmayo 4, 2022, 9:15 pm

40faktorovichmayo 4, 2022, 9:17 pm

41faktorovichmayo 4, 2022, 9:23 pm

42faktorovichmayo 4, 2022, 9:26 pm

43faktorovichmayo 4, 2022, 9:34 pm

44lilithcatmayo 4, 2022, 10:38 pm

45Crypto-Willobiemayo 4, 2022, 10:51 pm

46paradoxosalphaEditado: mayo 5, 2022, 12:21 am

47Petroglyphmayo 4, 2022, 11:34 pm

48Petroglyphmayo 4, 2022, 11:36 pm

49Petroglyphmayo 4, 2022, 11:37 pm

50Petroglyphmayo 4, 2022, 11:39 pm

51PetroglyphEditado: mayo 4, 2022, 11:59 pm

52PetroglyphEditado: mayo 5, 2022, 1:43 am

53Petroglyphmayo 5, 2022, 1:43 am

54faktorovichmayo 5, 2022, 11:33 am

55faktorovichmayo 5, 2022, 11:45 am

56lilithcatmayo 5, 2022, 11:58 am

57Keelinemayo 5, 2022, 12:01 pm

58paradoxosalphaEditado: mayo 5, 2022, 4:46 pm

59faktorovichmayo 5, 2022, 1:02 pm

60faktorovichmayo 5, 2022, 1:06 pm

61faktorovichmayo 5, 2022, 1:16 pm

62coprimemayo 5, 2022, 1:31 pm

63faktorovichmayo 5, 2022, 1:54 pm

64amanda4242mayo 5, 2022, 1:55 pm

65faktorovichmayo 5, 2022, 1:58 pm

66faktorovichmayo 5, 2022, 2:06 pm

67faktorovichmayo 5, 2022, 2:08 pm

68FAMeulsteemayo 5, 2022, 2:13 pm

69faktorovichmayo 5, 2022, 2:15 pm

70amanda4242mayo 5, 2022, 2:17 pm

71faktorovichmayo 5, 2022, 2:19 pm

72coprimemayo 5, 2022, 2:19 pm

73paradoxosalphaEditado: mayo 5, 2022, 4:05 pm

74FAMeulsteemayo 5, 2022, 2:29 pm

75Keelinemayo 5, 2022, 3:11 pm

76amanda4242mayo 5, 2022, 3:19 pm

77Petroglyphmayo 5, 2022, 3:34 pm

1amanda4242
mayo 3, 2022, 2:20 pm

2amanda4242
Editado: mayo 3, 2022, 2:27 pm

3Petroglyph
Editado: mayo 3, 2022, 2:55 pm

4amanda4242
mayo 3, 2022, 2:56 pm

5lilithcat
mayo 3, 2022, 3:01 pm

6paradoxosalpha
mayo 3, 2022, 3:08 pm

7Petroglyph
mayo 3, 2022, 4:11 pm

8abbottthomas
mayo 3, 2022, 6:08 pm

92wonderY
mayo 3, 2022, 6:26 pm

10faktorovich
mayo 3, 2022, 10:09 pm

11Petroglyph
mayo 3, 2022, 10:35 pm

12prosfilaes
mayo 3, 2022, 10:43 pm

13Keeline
mayo 3, 2022, 10:54 pm

14Petroglyph
Editado: mayo 4, 2022, 12:57 am

15prosfilaes
mayo 3, 2022, 11:32 pm

16Keeline
mayo 3, 2022, 11:55 pm

17Petroglyph
mayo 4, 2022, 2:35 am

18anglemark
Editado: mayo 4, 2022, 11:24 am

19faktorovich
mayo 4, 2022, 10:55 am

20faktorovich
mayo 4, 2022, 10:58 am

21faktorovich
mayo 4, 2022, 11:05 am

22faktorovich
mayo 4, 2022, 11:17 am

23anglemark
Editado: mayo 4, 2022, 11:46 am

24faktorovich
mayo 4, 2022, 11:51 am

25faktorovich
mayo 4, 2022, 11:54 am

26faktorovich
mayo 4, 2022, 11:58 am

27Keeline
mayo 4, 2022, 1:25 pm

28Keeline
mayo 4, 2022, 1:37 pm

29Petroglyph
mayo 4, 2022, 2:11 pm

30Petroglyph
mayo 4, 2022, 2:25 pm

31Petroglyph
Editado: mayo 4, 2022, 2:48 pm

32Petroglyph
mayo 4, 2022, 2:46 pm

33Petroglyph
mayo 4, 2022, 5:48 pm

34Petroglyph
mayo 4, 2022, 6:52 pm

35prosfilaes
mayo 4, 2022, 8:09 pm

36faktorovich
mayo 4, 2022, 8:47 pm

37faktorovich
mayo 4, 2022, 8:55 pm

38faktorovich
mayo 4, 2022, 9:10 pm

39faktorovich
mayo 4, 2022, 9:15 pm

40faktorovich
mayo 4, 2022, 9:17 pm

41faktorovich
mayo 4, 2022, 9:23 pm

42faktorovich
mayo 4, 2022, 9:26 pm

43faktorovich
mayo 4, 2022, 9:34 pm

44lilithcat
mayo 4, 2022, 10:38 pm

45Crypto-Willobie
mayo 4, 2022, 10:51 pm

46paradoxosalpha
Editado: mayo 5, 2022, 12:21 am

47Petroglyph
mayo 4, 2022, 11:34 pm

48Petroglyph
mayo 4, 2022, 11:36 pm

49Petroglyph
mayo 4, 2022, 11:37 pm

50Petroglyph
mayo 4, 2022, 11:39 pm

51Petroglyph
Editado: mayo 4, 2022, 11:59 pm

52Petroglyph
Editado: mayo 5, 2022, 1:43 am

53Petroglyph
mayo 5, 2022, 1:43 am

54faktorovich
mayo 5, 2022, 11:33 am

55faktorovich
mayo 5, 2022, 11:45 am

56lilithcat
mayo 5, 2022, 11:58 am

57Keeline
mayo 5, 2022, 12:01 pm

58paradoxosalpha
Editado: mayo 5, 2022, 4:46 pm

59faktorovich
mayo 5, 2022, 1:02 pm

60faktorovich
mayo 5, 2022, 1:06 pm

61faktorovich
mayo 5, 2022, 1:16 pm

62coprime
mayo 5, 2022, 1:31 pm

63faktorovich
mayo 5, 2022, 1:54 pm

64amanda4242
mayo 5, 2022, 1:55 pm

65faktorovich
mayo 5, 2022, 1:58 pm

66faktorovich
mayo 5, 2022, 2:06 pm

67faktorovich
mayo 5, 2022, 2:08 pm

68FAMeulstee
mayo 5, 2022, 2:13 pm

69faktorovich
mayo 5, 2022, 2:15 pm

70amanda4242
mayo 5, 2022, 2:17 pm

71faktorovich
mayo 5, 2022, 2:19 pm

72coprime
mayo 5, 2022, 2:19 pm

73paradoxosalpha
Editado: mayo 5, 2022, 4:05 pm

74FAMeulstee
mayo 5, 2022, 2:29 pm

75Keeline
mayo 5, 2022, 3:11 pm

76amanda4242
mayo 5, 2022, 3:19 pm

77Petroglyph
mayo 5, 2022, 3:34 pm

78Keeline
mayo 5, 2022, 3:35 pm

79anglemark
mayo 5, 2022, 3:40 pm

80Petroglyph
mayo 5, 2022, 3:41 pm