text watermarking against ownership rights violation

Viewer
Transcript

TEXT WATERMARKING AGAINST OWNERSHIP RIGHTS VIOLATION Olga Vybornova and Benoit Macq Communications and Remote Sensing Lab, Universite Catholique de Louvain, Belgium ABSTRACT The paper describes an approach to natural language watermarking and hashing based on semantic structures. In our method we are interested in the linguistic semantic phenomenon of presupposition. Presupposition is implicit information that is taken for granted by the reader and establishes common ground between the author’s and reader’s situational knowledge; it is a semantic component of certain linguistic expressions (lexical items and syntactic constructions called presupposition triggers, 100 in total). The same sentence can be used with or without presupposition, provided that all the relations between discourse referents are preserved. 3 types of transformations are employed: triggers removal, triggers synonymic replacements and introducing triggers where there were none. The key is formed as a hash table containing information about the original text, the number of transformed sentences and the triggers on which the transformations were based. This method is resilient against data loss and data altering attacks. Index Terms - natural language watermarking, robust hash for texts, semantic representations, presupposition. 1. INTRODUCTION In order to be suitable for watermarking purposes, any embedding process should not change the meaning of the text that should be represented in a clear and readable way in order not to disturb the communication and to preserve fluency and grammaticality to comply with the grammar rules of the language. Preserving the style of the author is also very important in some domains such as news channels or literature writing [4]. In our approach [6, 7] we distinguish between text and discourse meaning text as the result of verbal activity of this text producer (a speaker or a writer), and discourse as the verbal activity process, i.e. - the text together with all the pragmatic, psychological, cultural and other factors influencing this text generation. The text, i.e. the result of the discourse generation, is not an aggregate of separate sentences considered in isolation, the meaning of the whole text cannot always be perceived compositionally, but text is an integer entity holding all its intersentential links provided by certain linguistic means, such as anaphoric links, presuppositions, ellipsis, coreference, etc. Each sentence is a new contribution to

the whole discourse; information is accumulated with every new step and consistently integrated into the previous discourse. This property of integrity and underlying semantic relationships within a text allow us to develop a new robust method of text watermarking based on efficient semantic representations of the text. 2. THEORETICAL BASICS 2.1. How Can We Use Presupposition Triggers for Sentence Transformations? Our watermarking approach is based on the linguistic semantic phenomenon called presupposition. Presuppositions build the semantic basis of discourse, provide its coherence, consistency and are important for creation of the common ground between the author of the text and the readers. We define presupposition as a kind of implicit information which is considered well-known or which readers of the text are supposed to treat as wellknown; this information is a semantic component of certain linguistic expressions (lexical items and syntactical constructions) [8]. These linguistic expressions generating presuppositions are called presupposition triggers. Let us illustrate these theoretical statements on simple examples: (1) Mike likes his new Hummer. (2) Lily regrets that Chris is out of town. If we want to accept these sentences as normal contributions in the discourse, we have to take for granted that (1) Mike has a new Hummer and (2) Chris is out of town. Otherwise we have: (1’) Mike has no car? Mike likes his new Hummer. (unacceptable) (2’) Chris is in town? Lily regrets that Chris is out of town. (unacceptable) Thus, some information is conveyed in implicit form using certain linguistic tools. This kind of information even survives altered context conditions, for instance: (1’’) Mike does not like his new Hummer. (negation) If John never appears, then Mike likes his new Hummer. (conditional) Presupposition: Mike has a new Hummer. (2’’) Lily does not regret that Chris is out of town. (negation) If Ben is a lawyer, then Lily regrets that Chris is out of town. (conditional) Presupposition: Chris is out of town.

As we can easily infer, the same sentence can be used with or without presupposition, provided that all the relations between subjects, objects and other discourse referents are preserved, then such transformations will not change the meaning of the sentence. Linguistic transformations can be based on presupposition triggers. For our purposes we classify the following presupposition triggers: definite noun phrases (NPs) – 5 items; factive predicative constructions – 29 items; implicative verbs – 4 items; aspectual verbs and modifiers – 24 items; possessives – 7 items; iteratives – 20 items; interrogative words – 7 items and constructions denoting time – 4 items. Thus we have a list of 100 presupposition triggers which are very different in their structure and in the types of information that they generate. For all these presupposition triggers suitable for the purpose of text watermarking we define the distinct rules for presupposition identification for each trigger and consequently regular transformation rules for using/nonusing the presupposition in a given sentence, keeping in mind the necessity to keep untapped all the relations between subjects, objects and other discourse referents within a single sentence, and hence within the whole discourse. Here we describe implicative verbs and illustrate possible transformations on real-life text examples. For simplicity we generalize the linguistic analysis without detailed consideration of all the sophisticated peculiarities associated with each linguistic entry. Implicative verbs. Identification: verbs like manage, forget, happen, avoid, etc. are semantically rich. Let’s consider the following schema to illustrate the presupposed information that they trigger: X managed to V.

= “X tried to V”. The result is that X did V. X forgot to V.

= “X ought to have Ved, or intended to V”. The result is that X did not V. X happened to V.

= “X didn't plan or intend to V”. The result is that X did V. X avoided V-ing.

= “X was expected to, or usually did, or ought to V”. The result is that X did not V. Possibility for transformations: as we can see from the given schema, the main point to care about during the transformation is to preserve the “result” introduced by the implicative verb, and the implicative verb itself can be removed: Original sentence: Somehow I managed to wrench myself out of the dream, but not into a state of waking; it was like the screen went blank. // Neil Gaiman and Dave Mckean interview, Los Angeles, December 1994, originally broadcast on KUCI, 88.9FM.

= “I tried to wrench myself out of the dream” triggered by the implicative verb manage. Transformed sentence: Somehow I wrenched myself out of the dream, but not into a state of waking; it was like the screen went blank. Thus it is possible to transform a text by means of removal of presuppositions and converting sentences (or their corresponding fragments) into non-presupposed forms. However, the inverse process can hold – using the

same techniques, but applying them the other way round, we can introduce presuppositions into (parts of) sentences where there were originally none. 2.2. Semantic Representations and Presupposition Resolution An efficient and elegant way of representing the behavior of presuppositions in a discourse is to treat them as anaphoric expressions [5]. Presupposition triggers can be processed in the same way as anaphoric expressions. The detailed algorithm of presuppositional analysis is described in [8]. In formal description of the processes of presupposition interpretation, discourse representation structures (DRSs) are used. DRS is a concept within the frameworks of Discourse Representation Theory (partly described in [3]), being one of the most influential and interesting current approaches to the semantics of natural language. DRSs model semantic relationships within the discourse in general and mechanisms of presupposition interpretation in particular. DRSs are made up of two types of objects – discourse referents representing objects introduced in the discourse and conditions holding descriptive information about these referents. To put it formally: DRS R is a pair where: - U(R) is a set of reference markers - Con(R) is a set of DRS conditions If R and R‘ are DRSs, then ¬R, R∨R‘, R⇒R‘ are (complex) DRS-conditions. Let’s consider the following example: If Mike buys a new Hummer, he will like his car. The obtained DRS for this sentence is the following (for illustration purposes we describe only processes related to presupposition interpretation and generalize other details): [[x, y: Mike (x), Hummer (y), buy (x,y)] ⇒ [like (x,z) [α: car(z)]]] where α represents the presupposed information to be resolved. Presuppositions have a rich semantic content, capable of introducing new DRSs within bigger ones. We incorporate presuppositional information into a larger DRS and locate it in the web of discourse referents regulated by accessibility constraints. The resolution is performed by binding the presupposition “Mike has a car” to the antecedent (“Hummer”) found in the first part of the sentence: [[x, y: Mike (x), Hummer (y), buy (x,y)] ⇒ [like (x,y)]] If we build a DRS for the same sentence not containing the presupposition, but conveying the same meaning, we will have: If Mike buys a Hummer, he will have a car and he will like it. [[x, y: Mike (x), Hummer (y), buy (x,y)] ⇒ [Mike (x), car (z), like (x,z)]]

In the result of anaphoric pronoun resolution and bridging the “car” to “Hummer” (using created ontology) we will obtain the same representation as for the sentence containing presupposed information: [[x, y: Mike (x), Hummer (y), buy (x,y)] ⇒ [like (x,y)]] So, as we can see, the final semantic representations of sentences with and without presupposition are the same, since all the discourse referents and relations between them are preserved after transformation. Thus, the marked sentences will be those containing presupposition triggers captured by parsing (i.e. those for which DRSs representing their semantic structure will contain (an) embedded DRS(s) with presupposed information to be resolved). During the text analysis, logically transparent semantic representations (DRSs) of single sentences are created. In case a sentence is a part of a more extended discourse the named operations integrate the DRS into the whole previous discourse interpretation. This is an accumulative process, where new discourse referents and conditions holding for them, are incorporated into the main DRS representing the whole discourse contents. For syntactic and semantic analysis of the text including presupposition triggers detection and presupposed information processing we use the CCG (Combinatory Categorial Grammar) parser [2] that has wide-coverage and hence provides robust parsing. The CCG parser has a Boxer [1] as the add-on software package to generate semantic representations and to convert the CCG derivations into DRSs. 3. ALGORITHM OF AN EXTENDED DISCOURSE WATERMARKING 3.1. Algorithm By means of remarkable properties of presuppositions and constructions triggering them, we can easily transform any sentence in a text. We define three principal possibilities for sentence transformation: - make presupposed information explicit, i.e. get rid of presupposition triggers; - replace a presupposition trigger with another one; - introduce presupposed information where there was no. Having a file with a text to be watermarked, our task is to modify the text preserving its semantic content and to create a key allowing to perform inverse transformation and thus to confirm originality of the text (and to prove its ownership) or, otherwise, to prove that the text has been subjected to unauthorized modifications. Let’s illustrate our algorithmic procedure on a following real-life example. A: What kinds of things can people do to try to expand and reclaim democracy and the public space from corporations? B: Well, the first thing they have to do is find out what's happening to them. So if you have none of that information, you can't do much. For example, it's impossible to oppose, say, the Multilateral Agreement on

Investment, if you don't know that it exists. That's the point of the secrecy. You'd have to not only read the headlines which say market economy's triumphed, but you also have to read Alan Greenspan, the head of the Federal Reserve, when he's talking internally; when he says, look, the health of the economy depends on a wonderful achievement that we've brought about, namely "worker insecurity." That's his term. Worker insecurity is a great boon for the health of the economy because it keeps wages down. It's great: it keeps profits up and wages down. // A corporate watch interview with Noam Chomsky / By Anna Couey and Joshua Karliner, June 1998, http://zena.secureforum.com/Znet/zmag/articles/chomsky june98.htm We enumerate the sentences from 1 to N, it is 9 in our example case. For each sentence we form an array of 3 values: Sentence Number - Sentence Text - Triggers Code. The triggers code length is equal to the number of defined presupposition triggers that can be used for text transformations = 100 positions. Each position is associated with the particular trigger from our list of 100 and denotes how many times this trigger occurs in the given sentence. After that we make a point-to-point summation of triggers position numbers throughout all the sentences and obtain CHECK SUM 1 (SUM1), for our example text it will look like: 11010010000010000000000000000000000000000000 0020000000000000000010000000000000000000000000 00100000000 (In general, a text can be of arbitrary length, and hence it can contain any number of triggers. We set the maximum possible number in each code position 65535.) After that we make transformations in the text (we do not make more than one transformation per 3 sentences in order to keep fluency of the text and avoid the sentences to sound too heavy, artificial and hence obviously modified) and get the following resulting modified text: A: What kinds of things can people do to try to expand and reclaim democracy and the public space from corporations? B: Well, the first thing they have to do is see [the factive find out is replaced with a synonymic factive verb] what's happening to them. So if you have none of that information, you can't do much. For example, it's impossible to oppose, say, the Multilateral Agreement on Investment, if you don't know that it exists. That's the point of secrecy [definite article the is removed]. You'd have to not only read the headlines which say market economy's triumphed, but you also have to read Alan Greenspan, the head of the Federal Reserve, when he's talking internally; when he says, look, the health of the economy depends on a wonderful achievement that we've brought about, namely "worker insecurity." That's his term. Worker insecurity is a great boon for the economy health [definite article the is removed and the sentence` fragment is slightly paraphrased] because it keeps wages down. It's great: it keeps profits up and wages down. After the transformations we form the array of modifications comprising the following pairs:

Numbers of the sentences bearing transformations ::: Position number of the transformed trigger And then we recalculate the SUM1 and obtain SUM2: 90100100000000010000000000000000000000000000 0200000000000000000100000000000000000000000000 0100000000 Having obtained SUM1 and SUM2, we generate the key in the form of a hash table with the following fields: SUM1 - SUM2 - Number of transformations in the text - Numbers of the sentences bearing transformations Position number of the transformed trigger. Thus using the key we can make inverse (back) transformation of the text and make sure that the obtained code coincides with the original SUM1 (thus proving that the text is original, or otherwise, proving that the text has been subjected to unauthorized modifications). 3.2. Applications The watermark can convey a proof of information ownership, an integrity mark or a fingerprint containing the end-user id. Applications are possible in many domains, where pretty long texts need to be watermarked, including press agencies, authenticated text and e-books protection. Our method can be also used for very compact storage of text for retrieval. The method is domain-independent, i.e. we can process texts of any genre and any field because presuppositions are always present in any discourse regardless of its subject. Another advantage of the method is that it is potentially resilient to such an attack as translation into another language. The fact is that presuppositions, being a semantic phenomenon, are present in any human language, only the triggers generating them (particular words and syntactic constructions) will vary from language to language. It can be a subject of future research to define the lists of triggers for other languages and then the method itself will work anyway. 4. DISCUSSION AND FUTURE WORK We have described a method of text watermarking using presuppositions. We have explained the way how the semantic representations of sentences resist against surface transformations based on presupposition triggers. For the purposes of text watermarking we have studied so far the behavior of eight types of presupposition triggers in discourse. Some of them are well formalized and can be easily used by automatic algorithm. However a few triggers are hard to formalize and at present we make transformations based on them manually (while the watermarking algorithm itself is certainly automatic in all cases). Thus one of the directions of future research on this topic will be further work on formalization of the possibilities of transformations for several presupposition triggers. We study other methods of text watermarking with the help of presuppositions, since this phenomenon provides several possibilities for that. Even single sentences if

taken isolated can carry the proposed watermarks. In those cases presupposed information will be treated like new information, it will not be bound to information in other sentences, but still all the mentioned transformations can hold. However, the longer is the text to be watermarked, the more efficient and resilient the watermark can be made. If we create the web of bound information (resolved presupposed information bound to its antecedents in the previous text), it will hold the integrity of the text, introducing secret ordering into the text structure, in order to make it resilient to “data loss” attacks and “data altering” attacks - changing the order of sentences, removing sentences from the text or modifying them. Our own software for automatic text watermarking with the help of our method is under development. 5. REFERENCES [1] J. Bos, “Towards Wide-Coverage Semantic Interpretation”, Proc. of Sixth International Workshop on Computational Semantics IWCS-6, 2005. [2] S. Clark, J.R. Curran, “Parsing the WSJ using CCG and Log-Linear Models”, Proc. of the 42nd Ann. Meeting of the Assoc. for Computational Linguistics, Barcelona, Spain, 2004. [3] H. Kamp, “The Importance of Presupposition”, The Computation of Presuppositions and their Justification in Discourse, – ESSLLI’01, 2001. [4] M. Topkara, G. Riccardi, D. Hakkani-Tur, M. J. Atallah, “Natural Language Watermarking: Challenges in Building a Practical System”, Proc. of the SPIE International Conference on Security, Steganography, and Watermarking of Multimedia Contents, January 15 - 19, 2006, San Jose, CA. [5] R. Van der Sandt, “Presupposition projection as anaphora resolution”, Journal of Semantics, 9, 1992. [6] O. Vybornova, B. Macq “A method of text watermarking using presuppositions. Proc. of the SPIE International Conference on Security, Steganography, and Watermarking of Multimedia Contents, (SPIE’07), San Jose, CA, January 29 – February 1, 2007. [7] O. Vybornova, B. Macq: Natural Language Watermarking and Robust Hashing Based on Presuppositional Analysis // In: Proceedings of the 2007 International Conference on Information Reuse and Integration (IEEE IRI-07), August 1315, Las Vegas, USA. [8] O. Vybornova “Presuppositional Component of Communication and Its Applied Modeling” (Original title: Пресуппозиционный компонент общения и его прикладное моделирование), PhD dissertation, Moscow State Linguistic Univ., 2002.

Human Rights Violation - Outlook Afghanistan

Against Zoos - The Animal Rights Library

Multi-flow Attacks Against Network Flow Watermarking Schemes

Rights related to Beneficial Ownership re Bank Account in Taiwan ...

Despicable Crimes Against Humanity in USA to Stop Human Rights ...

Form COL. Violation Warning. Denial of Rights Under Color of Law ...

Ordinance-Violation-Complaint.pdf

Cadence Lux violation

Violation of Heisenberg's Measurement-Disturbance ...

Violation of Continuous-Variable Einstein ... - Semantic Scholar

352200 Anheuser-Busch Violation Records.pdf

Natural Language Watermarking

The rise in attacks against human rights defenders is the main ...

FA Third Party Ownership Regulations

Comparative Study of Reversible Image Watermarking: Fragile ...

Persistent Watermarking of Relational Databases

Intelligent reversible watermarking and authentication ...

Experimental Violation of a Spin-1 Bell Inequality ... - UCSB Physics