SEGMENTASI DOKUMEN BAHASA INDONESIA MENGGUNAKAN TEXT TILING
DOI:
https://doi.org/10.31000/jika.v5i3.5037Abstract
Text tiling aims to split long documents into multiple related paragraphs. In this study, the documents are used as data by omitting the reading format as inputs in the segmentation. Text tiling method has three stages, namely tokenisation, determination of similarity, and the introduction of limits. In this study, the results of the segmentation algorithm using tiling text has not yet reached the objective. This is because the segmentation of the document is strongly influenced by a common word file, the determined number of tokens in a token-sequence, and the determination of the number token-sequence within a block.Tthe writing of a word and text tiling algorithm is very sensitive to the reading format, such as titles and subtitles, so that the reading format must be removed to have the body of the text only. Segmentation results increased after the trials. From the experiment of the 15 reading segmentation results show that an accuracy of precision is 59,3% and of recall is 80%. These trials used 4140 common words. The total coefficient score for similarity is 5, the number of tokens in a token-sequence is 20, and the number of token-sequence within a block is 3.
Keywords :Â : text tiling, segmentation, multiparagraph segmentation
References
Claudia Regina Rahardjo. (2003). Studi Analisa Pengenalan Struktur Sub Topik dalam Teks dengan Menggunakan Algoritma Text Tiling. Perpustakaan Sekolah Tinggi Teknik (STTS) Surabaya, Indonesia.
Jati Sasongko Wibowo dan Sri Hartati. (Jan, 2011). Text Document Retrievel In English Using Keywords of Indonesian Dictionary Based. IJCCS, Vol. 5 No. 1.
Kosasih, E. (2007). 1700 Bank Soal Bimbingan Pemantapan Bahasa Indonesia Untuk SMA/MA. Bandung : Yrama Widya.
Lamhot Robinson. Implementasi Metode Generalized Vector Space Model Pada Aplikasi Information Retrieval untuk Pencarian Informasi Pada Kumpulan Dokumen Teknik Elektro Di UPT BPI LIPI. Universitas Komputer Indonesia. Bandung. ISSN : 2089-903.
M.K., Sabarti Akhadiah., Maidar Arsjad., dan Sakura Ridwan. (1986). Materi Pokok Bahasa Indonesia. Jakarta : Karunika Jakarta.
Marti A. Hearst. (29 April 1994). Context and Structurein Automated Full-Text Information Access. Computer Science Division (EECS) University of California Berkeley, California 94720.
Marti A. Hearst. (June 1994). Multi-Paragraph Segmentation of Expository Text. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM.
Marti A. Hearst. (1997). TextTiling : Segmenting Text into Multi-Paragraph Subtopic Passages. Comput. Linguist., vol. 23, no. 1, pp. 33–64. Retrieved from http://dl.acm.org/citation.cfm?id=972687%5Cnhttp://dl.acm.org/citation.cfm?id=972684.972687.
Marti A. Hearst and C. Plaunt. (1993). Subtopic Structuring for Full-Length Document Access. Proc. Annu. Int. ACM SIGIR Conf. Res. Dev. Infofmation Retr., no. June 2002, pp. 59–68. Retrieved from doi: 10.1145/160688.160695.
Rahardi, R. Kunjana. (2009). Penyuntingan Bahasa Indonesia Untuk Karang-Mengarang. Jakarta : Erlangga.
Rahardi, R. Kunjana. (2006). Dimensi-Dimensi Kebahasan Aneka Masalah Bahasa Indonesia Terkini. Jakarta : Erlangga.
Satanjeev Banerjee and Alexander I. Rudnicky. (2006). A TextTiling Based Approach to Topic Boundary Detection in Meetings. Language Technologies InstituteCarnegie Mellon UniversityPittsburgh, PA. United States.
Downloads
Published
Issue
Section
License
License and Copyright Agreement
In submitting the manuscript to the journal, the authors certify that:
- They are authorized by their co-authors to enter into these arrangements.
- That it is not under consideration for publication elsewhere,
- That its publication has been approved by all the author(s) and by the responsible authorities – tacitly or explicitly – of the institutes where the work has been carried out.
- They secure the right to reproduce any material that has already been published or copyrighted elsewhere.
- They agree to the following license and copyright agreement.
Copyright
Authors who publish with International Journal of Advances in Intelligent Informatics agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.Â
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
Licensing for Data Publication
International Journal of Advances in Intelligent Informatics use a variety of waivers and licenses, that are specifically designed for and appropriate for the treatment of data:
Open Data Commons Attribution License, http://www.opendatacommons.org/licenses/by/1.0/ (default)
Creative Commons CC-Zero Waiver, http://creativecommons.org/publicdomain/zero/1.0/
Open Data Commons Public Domain Dedication and Licence, http://www.opendatacommons.org/licenses/pddl/1-0/
Other data publishing licenses may be allowed as exceptions (subject to approval by the editor on a case-by-case basis) and should be justified with a written statement from the author, which will be published with the article.
Open Data and Software Publishing and Sharing
The journal strives to maximize the replicability of the research published in it. Authors are thus required to share all data, code or protocols underlying the research reported in their articles. Exceptions are permitted but have to be justified in a written public statement accompanying the article.
Datasets and software should be deposited and permanently archived inappropriate, trusted, general, or domain-specific repositories (please consult http://service.re3data.org and/or software repositories such as GitHub, GitLab, Bioinformatics.org, or equivalent). The associated persistent identifiers (e.g. DOI, or others) of the dataset(s) must be included in the data or software resources section of the article. Reference(s) to datasets and software should also be included in the reference list of the article with DOIs (where available). Where no domain-specific data repository exists, authors should deposit their datasets in a general repository such as ZENODO, Dryad, Dataverse, or others.
Small data may also be published as data files or packages supplementary to a research article, however, the authors should prefer in all cases a deposition in data repositories.