Corpus

The corpus contains over 70 000 tweets, written in Spanish by nearly 200 well-known personalities and celebrities of the world of politics, economy, communication, mass media and culture, between November 2011 and March 2012. Although the context of extraction has a Spain-focused bias, the diverse nationality of the authors, including people from Spain, Mexico, Colombia, Puerto Rico, USA and many other countries, makes the corpus reach a global coverage in the Spanish-speaking world.

Each Twitter message includes its ID (twitid), the creation date (date) and the user ID (user). The actual message content can be easily obtained using the Twitter API with the twitid.

Due to restrictions in the Twitter API Terms of Service), it is forbidden to redistribute a corpus that includes text contents or information about users. However, it is valid if those fields are removed and instead IDs (including Tweet IDs and user IDs) are provided. The actual message content can be easily obtained by making queries to the Twitter API using the twitid. In addition, using the user ID, it is possible to extract information about the user name, registration date, geographical information of his/her location, and many other fields, which may allow to perform experiments for instance on the different varieties of Spanish.

Each message is tagged with its global polarity, indicating whether the text expresses a positive, negative or neutral sentiment, or no sentiment at all. 5 levels have been defined: strong positive (P+), positive (P), neutral (NEU), negative (N), strong negative (N+) and one additional no sentiment tag (NONE).

Moreover, in those cases where applicable, this same polarity is tagged but related to the entities that are mentioned in the text.

There is also an indication of the level of agreement or disagreement of the expressed sentiment within the content. This is especially useful to make out whether a neutral sentiment comes from neutral keywords or else the text contains positive and negative sentiments at the same time.

On the other hand, a selection of a set of topics has been made based on the thematic areas covered by the corpus, such as "política" ("politics"), "fútbol" ("soccer"), "literatura" ("literature") or "entretenimiento" ("entertainment"). Each message of the corpus has been semiautomatically asssigned to one or several of these topics (most messages are associated to just one topic, due to the short length of the text).

This tagged corpus has been divided into two sets: training and test. The training set will be released along with the corresponding tags so that participants may train and validate their models for classification and sentiment analysis. The test corpus will be provided without any tag and will be used to evaluate the results provided by the different systems.

The full corpus can be downloaded as indicated below.

Data

The corpus is written in XML as defined by the following twits.xsd schema (Figure 1), in which the text of the content entity has been removed to follow the Twitter restrictions.


Twits XMLSchema(Figure 1)

The following figure shows the information of two sample twits. The second one is tagged with both the global polarity of the message and the polarity associated to each of the entities that appears in the text (UPyD and Foro Asturias), whereas the first twit is only tagged with the global polarity as the text contains no mentions to any entity.

		<twit>
			<twitid>0000000000</twitid>
			<user>usuario0</user>
			<content><![CDATA['Conozco a alguien q es adicto al drama! Ja ja ja te suena d algo!]]></content>
			<date>2011-12-02T02:59:03</date>
			<lang>es</lang>
			<sentiments>
				<polarity><value>P+</value><type>AGREEMENT</type></polarity>
			</sentiments>
			<topics>
				<topic>entretenimiento</topic>
			</topics>
		</twit>

		<twit>
			<twitid>0000000001</twitid>
			<user>usuario1</user>
			<content><![CDATA['UPyD contará casi seguro con grupo gracias al Foro Asturias.]]></content>
			<date>2011-12-02T00:21:01</date>
			<lang>es</lang>
			<sentiments>
				<polarity><value>P</value><type>AGREEMENT</type></polarity>
				<polarity><entity>UPyD</entity><value>P</value><type>AGREEMENT</type></polarity>
				<polarity><entity>Foro_Asturias</entity><value>P</value><type>AGREEMENT</type></polarity>
			</sentiments>
			<topics>
				<topic>política</topic>
			</topics>
		</twit>				
	

The twits_sample.xml file contains a subset of the corpus with around 30 tagged twits.

Note regarding the gold standard

The gold standard (or qrels in TREC context) has been generated by first pooling submissions from all participants, then a votation schema has been applied and finally an extensive human review of the ambiguous decisions (thousands of them). Due to the high volume of data, this process is unfortunately subject to errors and misclassifications.

Please tell us about any problem that you detect or any correction that you make on the corpus or the gold standard, and we will make it available to the community.

Downloads

Public files



Password protected area


Request a password

The corpus is freely available to the community. Please send an email to with your email, affiliation (institution, company or any kind of organization) and a brief description of your research objectives, and you will be given a password to download the files in the password protected area.

Citing TASS

If you use the corpus in your research, we would be very grateful if you include a citation to the paper and/or the website:

Villena-Román, Julio, Lana-Serrano, Sara, Martínez-Cámara, Eugenio, González-Cristobal, José Carlos. 2013. Revista de Procesamiento del Lenguaje Natural, 50, pp 37-44. http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/4657.

TASS - SEPLN (Taller de Análisis de Sentimientos en la SEPLN) website. http://www.daedalus.es/TASS.