<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>NLP &#8211; Megaputer Intelligence</title>
	<atom:link href="https://www.megaputer.com/ko/tag/nlp/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.megaputer.com/ko</link>
	<description>지식 파트너</description>
	<lastBuildDate>Tue, 24 Mar 2026 00:02:52 +0000</lastBuildDate>
	<language>ko-KR</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=5.0.22</generator>

<image>
	<url>https://www.megaputer.com/wp-content/uploads/favicon.png</url>
	<title>NLP &#8211; Megaputer Intelligence</title>
	<link>https://www.megaputer.com/ko</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>What is Linguistic Corpus?</title>
		<link>https://www.megaputer.com/ko/what-is-linguistic-corpus/</link>
		<pubDate>Tue, 05 Dec 2023 20:55:10 +0000</pubDate>
		<dc:creator><![CDATA[Echo Lu]]></dc:creator>
				<category><![CDATA[미분류]]></category>
		<category><![CDATA[컴퓨터 언어학]]></category>
		<category><![CDATA[NLP]]></category>

		<guid isPermaLink="false">https://www.megaputer.com/?p=35025</guid>
		<description><![CDATA[<p>The post <a rel="nofollow" href="https://www.megaputer.com/ko/what-is-linguistic-corpus/">What is Linguistic Corpus?</a> appeared first on <a rel="nofollow" href="https://www.megaputer.com/ko">Megaputer Intelligence</a>.</p>
]]></description>
				<content:encoded><![CDATA[<section class="l-section wpb_row height_small"><div class="l-section-h i-cf"><div class="g-cols vc_row type_default valign_top"><div class="vc_col-sm-12 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<p><span style="font-weight: 400;">At the core of Natural Language Processing lies extensive language data. The availability of copious amounts of text or speech, faithfully representing natural language, is pivotal for the efficacy of artificial language systems. Diverse forms of text, from Wikipedia articles to Twitter threads, find application in various language models. These substantial language datasets collectively constitute what is known as a corpus in linguistics—an extensive and principled collection of authentic, naturally occurring language.</span></p>

		</div>
	</div>
<div class="w-separator size_small"></div><div class="w-image align_center"><div class="w-image-h"><img width="564" height="374" src="https://www.megaputer.com/wp-content/uploads/photo_2023-11-20_12-03-27.jpg" class="attachment-full size-full" alt="Random collection of different words and word-forms on magnetic tiles" srcset="https://www.megaputer.com/wp-content/uploads/photo_2023-11-20_12-03-27.jpg 564w, https://www.megaputer.com/wp-content/uploads/photo_2023-11-20_12-03-27-300x199.jpg 300w" sizes="(max-width: 564px) 100vw, 564px" /></div></div><div class="w-separator size_medium"></div>
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<p><span style="font-weight: 400;">Despite gaining increased attention with the advent of LLMs, the concept of a corpus is not new in linguistics, dating back to the early 20th century or earlier. During that time, lexicographers manually compiled substantial language samples for dictionary construction, representing a rudimentary form of a corpus. Beyond lexicographers and grammarians, linguists manually compiled corpora for various purposes, such as language pedagogy or the study of language acquisition in children. Moreover, the so-called &#8220;structural&#8221; linguists, who dominated the field until the 1950s, emphasized the utility of corpora, believing in the analytical approach to the distributional characteristics of language. With this idea, linguist Charles C. Fries, for example, manually gathered telephone conversations from approximately 300 speakers, encompassing 250,000 words. Field linguist Franz Boas dedicated himself to establishing a substantial corpus of texts from Native Americans.</span></p>

		</div>
	</div>

	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<p><span style="font-weight: 400;">In the 1960s, however, the first electronic corpus appeared. W. Nelson Francis and Henry Kučera, based at Brown University, crafted a corpus of 1 million American English words, famously known as the Brown corpus. Thereafter, linguistic corpora rapidly evolved and diversified in terms of size and scope, driven by advancements in computer hardware. For instance, </span><a href="https://www.english-corpora.org/coca/"><span style="font-weight: 400;">the Corpus of Contemporary American English (COCA)</span></a><span style="font-weight: 400;">, currently </span><span style="font-weight: 400;">one of the most widely used American English corpora, encompasses over one billion words sourced from eight different genres.</span></p>

		</div>
	</div>
<div class="w-separator size_medium"></div><div class="w-image align_center"><div class="w-image-h"><img width="600" height="451" src="https://www.megaputer.com/wp-content/uploads/istock-1439146187-600x451.jpg" class="attachment-shop_single size-shop_single" alt="Abstract source code background, Big data database app, Computer code, Lines of HTML code by programmer, active online data transformation and exchange" srcset="https://www.megaputer.com/wp-content/uploads/istock-1439146187-600x451.jpg 600w, https://www.megaputer.com/wp-content/uploads/istock-1439146187-300x225.jpg 300w, https://www.megaputer.com/wp-content/uploads/istock-1439146187-1024x769.jpg 1024w, https://www.megaputer.com/wp-content/uploads/istock-1439146187-768x577.jpg 768w, https://www.megaputer.com/wp-content/uploads/istock-1439146187-533x400.jpg 533w" sizes="(max-width: 600px) 100vw, 600px" /></div></div></div></div></div></div></div></section><section class="l-section wpb_row height_small"><div class="l-section-h i-cf"><div class="g-cols vc_row type_default valign_top"><div class="vc_col-sm-12 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<h2><b>Types of Corpora and their use</b></h2>

		</div>
	</div>

	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<p><span style="font-weight: 400;">Corpora can be broadly categorized into two types: generalized and specialized. A generalized corpus aims to represent all aspects of a language and typically consists of a large number of words and/or documents. Examples of generalized corpora include the </span><a href="http://www.natcorp.ox.ac.uk/"><span style="font-weight: 400;">British National Corpus</span></a><span style="font-weight: 400;"> and </span><a href="https://www.english-corpora.org/coca/"><span style="font-weight: 400;">COCA</span></a><span style="font-weight: 400;">, mentioned earlier. On the other hand, a specialized corpus is constructed to represent specific aspects of a language, such as language produced by particular user groups (e.g., children, teenagers, second language learners, aphasics) or in specific settings (e.g., university, business). The </span><a href="https://childes.talkbank.org/access/"><span style="font-weight: 400;">Child Language Data Exchange System (CHILDES)</span></a><span style="font-weight: 400;">, a large corpus of language produced by developing children, and the </span><a href="https://scnweb.japanknowledge.com/register/PERC/index.html"><span style="font-weight: 400;">Professional English Research Consortium (PERC)</span></a><span style="font-weight: 400;"> corpus, which contains English academic journal texts in science, engineering, technology, and other fields, are examples of specialized corpora.</span></p>

		</div>
	</div>
<div class="g-cols wpb_row type_default valign_top vc_inner "><div class="vc_col-sm-12 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<p><img style="float: right;" src="https://www.megaputer.com/wp-content/uploads/frame-2.png" alt="Parts of Speech" width="500" /><span style="font-weight: 400;">Corpora are frequently enriched with annotations containing linguistic information. Typically, texts within corpora undergo Part of Speech (POS) tagging, wherein the grammatical category of each word is identified. Beyond POS tagging, corpora can be further enhanced with annotations covering syntactic structure, semantic roles (denoting the roles played by each noun argument in a sentence, such as agent, patient, instrument, etc.), Named Entity Recognition (NER), sentiment analysis, and more.</span>To illustrate, the <a href="https://catalog.ldc.upenn.edu/LDC99T42"><span style="font-weight: 400;">Penn Treebank</span></a><span style="font-weight: 400;"> project annotates a collection of Wall Street Journal articles with both syntactic dependency information and POS tagging. Another type of syntactic corpus might be the </span><a href="http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html"><span style="font-weight: 400;">Google Books Syntactic N-grams dataset</span></a><span style="font-weight: 400;">, comprising syntactic tree fragments automatically extracted from the Google English Books collection. Some corpora are annotated with semantic information. The </span><a href="https://propbank.github.io/"><span style="font-weight: 400;">Proposition Bank</span></a><span style="font-weight: 400;">, for example, provides a collection of English sentences that are manually annotated with various semantic roles. Some corpora also feature Named Entity Recognition. For example, </span><a href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-6-S1-S3"><span style="font-weight: 400;">GENETAG</span></a><span style="font-weight: 400;"> is a corpus of 20,000 sentences from the MEDLINE database annotated with gene/protein NER.</span></p>

		</div>
	</div>
</div></div></div></div></div></div></div></div></div></section><section class="l-section wpb_row height_small"><div class="l-section-h i-cf"><div class="g-cols vc_row type_default valign_top"><div class="vc_col-sm-12 wpb_column vc_column_container"><div class="vc_column-inner"><div class="wpb_wrapper">
	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<h2><b>Use of Corpora in NLP</b></h2>

		</div>
	</div>

	<div class="wpb_text_column ">
		<div class="wpb_wrapper">
			<p><span style="font-weight: 400;">Corpora have traditionally served as tools to investigate specific aspects of language use, including collocation, frequency, preferred constructions, patterns of errors, and more. With the emergence of computer technology and machine learning algorithms, they have evolved into essential resources for model training and validation. Large Language Models are now trained on vast amounts of language data. For instance, OpenAI&#8217;s GPT-3 175B model is trained with 300 billion tokens gathered from sources such as Common Crawl, WebText2, and Wikipedia. Meta&#8217;s LLaMa 65B model, on the other hand, is trained with 1.4 trillion tokens sourced from various platforms, including English CommonCrawl, C4, Github, Wikipedia, ArXiv, and Stack Exchange. In our PolyAnalyst, diverse corpora like </span><a href="https://anc.org/"><span style="font-weight: 400;">Open American National Corpus</span></a><span style="font-weight: 400;">, the </span><a href="https://catalog.ldc.upenn.edu/LDC99T42"><span style="font-weight: 400;">Penn Treebank</span></a><span style="font-weight: 400;">, CoNLL Corpus, </span><a href="https://opus.nlpl.eu/Europarl-v3.php"><span style="font-weight: 400;">Europarl</span></a><span style="font-weight: 400;">, the </span><a href="https://ruscorpora.ru/en/"><span style="font-weight: 400;">Russian National Corpus</span></a><span style="font-weight: 400;">, and Wikipedia are employed for various modules. The utilization of large language datasets is poised to expand further in the future, making the adept handling of such data a crucial skill in the field of Natural Language Processing.</span></p>

		</div>
	</div>
</div></div></div></div></div></section>
<p>The post <a rel="nofollow" href="https://www.megaputer.com/ko/what-is-linguistic-corpus/">What is Linguistic Corpus?</a> appeared first on <a rel="nofollow" href="https://www.megaputer.com/ko">Megaputer Intelligence</a>.</p>
]]></content:encoded>
			</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.w3-edge.com/products/

Page Caching using disk: enhanced 
Minified using disk

Served from: www.megaputer.com @ 2026-06-09 11:51:30 by W3 Total Cache
-->