/sci/ - Science & Math » Thread #14183394

349KiB, 1920x1080, Emma-Watson-Wallpapers.jpg

View Same Google iqdb SauceNAO

You Wouldn't Patent the?Sun

Anonymous Thu 03 Feb 21:04:21 2022 No.14183394 View Reply Original Report

Quoted By:

Concept for Algorithmic Identification of Primary Spoken Language of Second-Language Speaker Using Meta-Analysis of Deviations from Proper Usage Both for Algorithmically-Translated Text and Human-Translated Text

Human beings and machines, alike, use algorithms to parse and process information concerning language. Because simple translation algorithms still lack the richness of capacity of a human translator, it is fairly easy to determine when a sample of text is the byproduct of an algorithmic translator, provided the sample size is large enough. In the world of online propaganda, nation-states use a combination of algorithmically-translated text and “expertly” translated texts prepared by humans.

While simply having a copy of the Google Translate translation matrix would enable a programmer to create an algorithm that indicates what the native language of a speaker might be when text is algorithmically-translated (it is not clear if even this level of analysis has been achieved with an algorithm since nothing came up in a cursory search of Google Patents,) when language is translated by a human, one must build their own system for analysis of the text that takes into consideration common errors made even by skilled translators including artifacts of language that do not technically constitute errors and which would be perceived as irregular only by a native speaker with great verbal aptitude.

Anonymous

Anonymous Thu 03 Feb 2022 21:04:53 No.14183400 Report

Quoted By:

There are dozens if not hundreds of individual parameters this algorithm could take into consideration. Of greatest value are those “errors” that do not technically constitute errors. Most useful toward this end, in this author’s judgment, would be the close analysis of linguistic collocation.

To achieve this, an extremely detailed matrix of word collocation would be required that would need to be established by consensus of expert English (for the purposes of our prototype) speakers. Our algorithm could certainly look at other elements of language such as: Noun-Verb sequence e.g. ‘He was running’ vs. ‘Running was he.’ as well as gender-specific language as elements of Spanish and even Russian language are gender-specific. English and other languages lack these gender-specific terms. The accidental incorporation of gender-specific elements in text written by someone masquerading as an English speaker would serve as a strong red flag, of course.

In the case of algorithmic translations, one interesting red flag we could watch for would be for the entire body of a text to translate perfectly to, and once again, back from a specific language. One tactic used by adversarial propagandists is to check their translation against the translation matrix itself to see if it remains the same when translated back into their native language. They tend only to use translations that remain consistent when translated back. This does not mean the translation is correct, however, in their view, this decreases the likelihood of error in the absence of a senior English language “expert.”

Anonymous

Anonymous Thu 03 Feb 2022 21:06:01 No.14183403 Report

Quoted By:

Understanding which dialect of the English language an adversary seems to be speaking can betray useful information concerning their nation of origin. Whereas Russian propagandists and spies study American English, Israeli, Chinese, and Indian spies, just to name a few, study British English. It is important to note, however, that the differences between American and British English are limited enough in number that a skilled linguist could successfully convince a reader in America of their being British or a reader in England of their being American with relative ease, especially when we are talking about written language without the benefit of being able to see with whom we’re speaking.
Algorithmic translations carry significant risk including the risk that an unknown individual working at Google, for example, could be asked by a governmental entity to deliberately incorporate intentional errors into translated text that are specific to other languages in order to help them to identify algorithmically translated text. It is equally plausible that China and/or the United States is/are already employing this gimmick through official or unofficial requests to Google employees that have sufficient access.

While generic, run-of-the-mill propaganda can be readily identified through simple observation of the ostensible agenda being promoted and packet analysis of the associated traffic is certainly revelatory in most cases, there are forms of propaganda that are more carefully crafted and not deployed carelessly like buckshot into online forums. Some of these forms of propaganda can make their way into movie and television scripts, news teleprompter scripts, and especially YouTube and other online videos where analysis is not possible without first transcribing the video, which is time-intensive.

Anonymous

Anonymous Thu 03 Feb 2022 21:07:03 No.14183408 Report

Quoted By:

In these cases, accents can be faked with great skill and precision. However, the narrators are reading off of a script with instructions not to deviate from that script. These are the cases where it is most critical to know the source of the potential propaganda and where this sort of automated analysis can be best employed.

While I cannot practically list all of the possible collocations of the English language, I will provide some examples of how collocation analysis can be useful for achieving the task of identifying the native language of an individual behind English text:

Word-word collocation:

Whole dictionaries are available that provide common examples of word-word collocations. Absent the use of malapropisms, a non-native speaker will miss obvious opportunities to collocate words on paper that should already be mentally collocated. Mental collocation is the byproduct of being exposed to language for decades. Mental collocations tend to lead to physical collocations of the words in text. We take them for granted as native speakers, but careful meta-analysis of text can reveal the failure to collocate words. A non-native speaker may not have enough experience with their second language to use the most apt collocations in each possible situation. For a native speaker, there is generally a 100% overlap between their mental collocations and the collocations of words in their writing.

Anonymous

Anonymous Thu 03 Feb 2022 21:08:04 No.14183412 Report

Quoted By:

Run-On Sentences and Pithy Sentences:

Non-native speakers will often, even in the case of human-translated text, keep their written sentences short because of their awareness of the potential pitfall of run-on sentences. They will tend to err on the side of writing shorter sentences, even at the risk of coming off as “dumb.” Spies often find it useful to be perceived in this way. Native speakers with a poor grasp of the English language will tend to engage in run-on sentences when writing and non-natives will tend to “tie off” sentences as early as possible lest they make a mistake (something human translators share in common with users of automated translators,) or lest they confuse the algorithm.

Spanish and Russian Overuse of Adjectives:

The native Spanish and Russian speakers are renowned for being self-conscious about the possibility of imprecision in language, and so they will pepper their sentences with many near-synonyms, usually “-ing” words to make sure they are conveying their point clearly, seemingly unsure of how what they’re saying will be perceived. While a native speaker engaged in persuasive writing will often use two synonyms consecutively and will unconsciously err toward alliteration in synonyms, non-native speakers will cluster 3 or more synonyms in the same sentence but consciously avoid alliterative synonyms since they understand that not all words that are alliterative or share common roots actually mean the same thing. Where the native speaker tries to convince you they are “smart” the non-native speaker tries to convince you they are native. This manifests itself as a marked difference in patterns of language usage.

Anonymous

Anonymous Thu 03 Feb 2022 21:09:19 No.14183415 Report

Quoted By:

Chinese Failure to Use Conjunctions:

Much as with Latin, Mandarin tends not to utilize so many conjunctions and so when a Mandarin speaker writes or speaks in English, they completely omit conjunctions because in their mind, people can still “get the gist” of what they’re saying. More important in that system of language is the sequence in which words or characters are used. If someone starting out as an English speaker tried to learn Mandarin, they would find themselves trying to insert conjunctions where none are needed and would occasionally err in this way.

Native-typical Malapropism vs. Non-Native Malapropism:

Non-Native speakers take courses in common malapropisms of the English language e.g. “It’s a doggy dog world,” “for all intensive purposes,” and “Irregardless” and will actively avoid misuse of these phrases and words. In most cases, they will not even attempt correct usage because they’re taught it’s hazardous territory (If I had said ‘dangerous territory,’ that would have been an example of non-native collocation. ‘Hazardous’ and ‘Dangerous’ may mean the same thing, but in the context of verbal faux pas, ‘hazardous’ is the more frequently used pairing. Some English speakers, when they speak of collocative frequency, use the expression, “It just sounds better this way.” They do not understand that collocative frequency is at the essence of their perception that it sounds better.)

Anonymous

Anonymous Thu 03 Feb 2022 21:10:20 No.14183419 Report

Quoted By:

Those are just a few simple examples of the sorts of things such an algorithm could search for to achieve this goal. Where algorithms are relatively poor at understanding the concept of “uncanniness” when it comes to the fabrication of images (as in GANs,) algorithms have an opportunity to shine when it comes to characterizing dynamics of language since we unwittingly use word collocation almost exclusively to judge the “canniness” or “uncanniness” of someone’s language i.e. whether they are a native speaker or not.

I believe that specific patterns found in second-language writing samples can conclusively and accurately betray the native language of the speaker, something that has real potential to allow our own counter-propaganda machinations to function more efficiently given the limited number of human language experts available to review material. Such a system would reduce the likelihood of propaganda evading detection regardless of the care taken in its preparation. Given that misattribution has been a recurring issue with enemies and allies alike, I believe such a system would also prove useful for preventing successful misattribution attacks where written taunts are involved. Software capable of achieving this end should be easy to market to the relevant agencies interested in detecting and characterizing foreign-source propaganda.

The Future Is Made in America
03Feb2022

Capcode	All Only User Posts Only Moderator Posts Only Admin Posts Only Developer Posts
Show Posts	All Only With Images Only Without Images
Deleted Posts	All Only Deleted Posts Only Non-Deleted Posts
Ghost Posts	All Only Ghost Posts Only Non-Ghost Posts
Post Type	All Only Sticky Threads Only Opening Posts Only Reply Posts
Results	All Grouped By Threads
Order	Latest Posts First Oldest Posts First

Your latest searches