Megaputer Intelligence

Can AI Understand Your Customers’ Needs?

Sergei Ananyan — Tue, 24 Feb 2026 20:54:16 +0000

Right now, your customer feedback is telling you exactly what’s wrong – and you’ll never read it.

Not because you don’t care – but because the feedback is so diverse and volume is overwhelming. Thousands of reviews, survey responses, support tickets, employee comments, interview transcripts, and open-ended form fields pour in constantly. There’s simply no practical way for a human team to read through all of it, find the key patterns, and turn those patterns into action. So most of that data sits untouched, a goldmine that never gets mined.

It seems like AI should be able to solve this problem. After all, AI can read all of your data. But if you’ve tried feeding your data into an AI system and asking for answers, you probably didn’t get what you wanted. An ordinary AI will just hallucinate, and a RAG system can’t find patterns that only emerge over thousands of records.

What you need is not a smarter prompt, a bigger model, or a more sophisticated RAG system. You need a system that was designed from the ground up to listen for patterns over vast swathes of data. You need a system that reads all your documents to identify your problems, determines which problems are driving the selected KPIs, and writes a report on each one – with citations from the data to verify that it isn’t hallucinating.

This is exactly what we have.

**What Is Deep.Listen?**

Deep.Listen™ is a specialized system for feedback analysis powered by AI agents that work together to do what no single AI model can do – read all of your data, find the top issues driving your selected KPIs, and write a clear report on each of them. These AI insight reports provide actionable recommendations for fixing the most important problems.

From your perspective, the process is straightforward:

You provide the data: Send us your customer reviews, stakeholder feedback, or other documents in any standard format.
We run the analysis: Our system analyzes your data, typically completing within the same day, regardless of volume. The system automatically identifies the key issues that are driving your chosen KPIs and then generates a written report on each of them.
Human in the loop: After our human analysts have reviewed the results, we deliver an AI-generated and analyst-audited report on each of the most pressing issues that are driving your KPIs. This process may take a few days.

You give us data. We give you answers. It’s as easy as that.

A Real-World Example: Finding Hidden Liabilities

In a recent engagement, a customer had tens of thousands of reviews that were sitting unread on their website. We ran these reviews through Deep.Listen, specifying the star rating as the target KPI. Within hours, the system surfaced something the team hadn’t flagged: multiple systematic labeling issues were causing customers to feel deceived about their purchases of certain juice products.

The business impact was significant. Customers affected by the confusion rated their purchases an average of 2.2 stars, compared to a 4.6-star average across the full dataset. Reviews were littered with words like “deceptive” and “misleading,” and some customers complained of health issues.

The problem was complex.

Deep.Listen identified dozens of product pages in which a “New Look” image overlaid the product’s hero image in a way that obstructed essential regulatory disclosures. In some cases, customers believed that they were buying 100% juice because the “with added ingredients” disclosure was obstructed by the “New Look” image overlay. Beyond lost trust and repeat purchases, this labeling issue created significant legal liability.

Deep.Listen also discovered a labeling issue that was causing some customers to believe they were purchasing pure cranberry juice when, in fact, they were purchasing a blend. Some customers reported allergies to other fruits in the blend, while others reported that they had intended to buy pure cranberry juice to treat a UTI infection, which was not effectively treated by this cranberry-flavored blend. Both of these issues put the customers’ health at risk, degraded customer satisfaction, and generated legal liability for the company.

We did not direct Deep.Listen to look for liability risks or to examine product labeling – it discovered these issues on its own. In its’ drive to maximize the target KPI, it automatically found the problems that mattered most.

These hidden liability risks were just one of several issues that Deep.Listen discovered for the company. These issues were hiding in plain sight across thousands of reviews. It took the AI hours to find them. It could have taken a human team months – if they’d ever found them at all.

How It Works

Under the hood, Deep.Listen is a four-agent system.

First, a pair of cooperating agents reads your documents and organizes them into topics and subtopics. These agents are searching for any issues that could potentially drive your KPIs, not just what you told them to look for. Then a third agent calculates each topic’s significance using machine learning to determine which issues are actually driving your target KPIs.

Finally, a fourth agent investigates each significant topic – reading every single piece of feedback associated with it – and writes a plain-language report that explains the issue and makes specific recommendations for improvement. Every report is reviewed by a human analyst for errors, and is grounded in citations from your data, so you can verify that what you’re reading is real – not a hallucination. Sometimes these reports are produced through multiple rounds of clarifying conversation between the AI and the human in the loop.

The result is a report with actionable recommendations to unleash your KPIs.

The Applications Are Broad

Any organization that collects unstructured text data can benefit from Deep.Listen. A few examples include:

Retailers and product companies: analyzing customer reviews
Healthcare organizations: mining patient feedback for care gaps and satisfaction drivers
HR: reviewing employee engagement surveys or exit interviews
Government and NGOs: analyzing public consultation responses or stakeholder feedback
Market researchers: reading open-ended survey responses at scale
Financial services firms: reviewing client feedback, complaints, or advisor notes

If you are receiving customer feedback, Deep.Listen can help you organize it and prioritize what really matters.

Ready to See What’s Hiding in Your Data?

Deep.Listen can be applied to any data source, any format, and any volume of text. You don’t need to know what you’re looking for in advance – you just need data and a target KPI. We will do the rest and provide you with actionable insights and recommendations.

Contact Megaputer Intelligence today to discuss your data and needs and receive a custom proposal.

info@megaputer.com

Quotes available upon request.

View the webinar recording: Deep.Listen – Actionable Insights from Customer Feedback

The post Can AI Understand Your Customers’ Needs? appeared first on Megaputer Intelligence.

Webinar: Accelerating Drug Discovery with AI Agents

Sergei Ananyan — Wed, 21 Jan 2026 18:08:17 +0000

AI is transforming early-stage drug discovery by accelerating key tasks such as target identification, molecular design, virtual screening, ADMET prediction, and drug repurposing. In this webinar, we’ll focus on one of the most critical steps: using AI agents to speed up target identification.

By combining scientific literature with experimental and omics data, AI can help prioritize disease-relevant genes, proteins, and pathways – enabling faster generation and evaluation of therapeutic hypotheses. This supports both target-based discovery and indication expansion for novel or existing molecules.

Building reliable, pathway-centric knowledge systems isn’t easy. Data volume, quality, bias, and integration remain major challenges.

Topics covered:

Prioritizing disease-relevant genes, proteins, and pathways with AI
Creating AI agents without coding
Typical challenges and how to overcome them
Live demo of a practical solution and results

The post Webinar: Accelerating Drug Discovery with AI Agents appeared first on Megaputer Intelligence.

RxLaunch: AI Agents for Pharma Launch Monitoring

Sergei Ananyan — Tue, 21 Oct 2025 21:44:12 +0000

Catch issues before they derail your product launch.

The Problem: When Insights Arrive Too Late

It’s week three after the new product launch. Early monitoring shows a growing spike of negative sentiment toward your product compared to competing products – though sentiment was well above parity on launch day. Pharmacies are reporting patient concerns. Medical information calls are higher than expected. You have thousands of pieces of feedback but no clear picture of what’s driving the problem.

Using ordinary tools, your team might take weeks to comb through the textual feedback to reveal those drivers of negative sentiment that weren’t flagged for monitoring in advance of the launch day. Three weeks from now, you’ll be at week six. Thousands of patients will have discontinued. Prescribers will have formed opinions. The launch trajectory will be set.

The speed of identifying and resolving key issues determines launch success – or failure. By the time your analysis delivers answers several weeks later, the critical early period has already passed: you’re just getting an explanation of what went wrong instead of arming yourself with the insights you need to fix the situation.

RxLaunch™ Provides Timely Insights

RxLaunch™ by Megaputer Intelligence is a pharmaceutical launch intelligence platform that uses AI agents to transform launch feedback into actionable insights in just a matter of hours, allowing your team to recognize and respond to key issues as they emerge in real-time.

You load in all your feedback: patient comments, prescriber notes, medical information calls, pharmacy reports, support tickets, and other data. RxLaunch’s AI agents perform semantic analysis of the texts and generate a hierarchy of informative tags that organize the feedback. Next, the system determines which issues are driving the key performance indicators (KPIs) such as stakeholder sentiment, adherence, or patient discontinuation. Notably, RxLaunch does not require you to specify the patterns that you are looking for in advance.

RxLaunch™ ingests launch feedback, discovers emerging issues, and ranks the issues by impact on selected KPIs

You receive a prioritized action list showing exactly which issues to address first. This list shows which issues are actually driving your KPIs, and which issues can be safely ignored. No need to train a neural network. No category definitions. No technical expertise required. The RxLaunch agents do all the work and deliver insights in an easy to comprehend form.

How One Pharma Company Saved Its Launch

Note: The following scenario is a hypothetical example designed to illustrate potential applications and outcomes. Specific metrics and results are for illustrative purposes only.

A pharmaceutical company launched a chronic disease medication. Launch day went smoothly. Week one looked promising. But by week three, early data showed a problem: sentiments were slipping compared to competing products. The launch was losing momentum at the worst possible time – during the critical early weeks when market position gets established.

The company had the feedback: pharmacist notes, medical information calls, patient comments, online discussions – tens of thousands of text records in all. What they didn’t have was clarity. But the use of RxLaunch enabled them to analyze the incoming data daily, automatically organizing and prioritizing issues as they emerged.

RxLaunch revealed that the drop in sentiment from patients and providers was being driven by a surprising issue – “refrigeration”. This was unexpected, since the product did not require refrigeration. Drilling down into the subtopics of the refrigeration category showed “can’t travel with medication,” “does not fit with lifestyle,” and surprisingly, mentions of a specific injection pen for an unrelated product in a different therapeutic area.

A closer inspection of the records showed that the product’s injection pen bore a superficial resemblance to another widely-used injection pen for a medication that required refrigeration. Even though the injection pen was for a different therapeutic category, and made by a different manufacturer, the superficial similarities in mechanism, appearance, and size caused some patients to draw a parallel. Those patients who’d used the other injector pen were instinctively storing the new medication the same way – and convincing others to do the same.

The refrigeration myth was spreading virally. During week one, just 2-3% of patients were refrigerating due to individual confusion. But by week three, that number had risen to 14%. Projections showed that the myth would have spread to a majority of patients by week eight if the trend had continued.

Refrigeration didn’t harm the drug, but it destroyed adherence. Patients could forget doses when traveling, skip injections when away from home overnight, or just find the cold-storage routine burdensome. Many who believed the myth said that the product was “too inconvenient,” and often stopped refilling.

Rapid Response and Recovery

Acting on this insight, the company moved immediately. Within a day, the patient support and communications teams began posting clarifications in online communities where misinformation was spreading. In parallel, pharmacy hotlines and patient support staff were briefed to proactively correct the misconception during inbound calls and chat interactions. Within a week, pharmacists received an updated counseling cue card: “Storage: room temperature; do not refrigerate.” Automated text messages on day two and day seven reminded patients to store the product at room temperature.

Within eight weeks, belief in the refrigeration myth had dropped to almost zero. Patient and provider sentiment recovered, regaining full parity with competing products. Six-month refill rates came in at 72%, two points above projection.

Traditional analysis might have taken weeks to find the issue. In those extra weeks, the product’s image would have been irreparably tarnished. While the myth would have eventually been discovered and put down, first impressions among patients and healthcare providers cannot easily be corrected. The product would have been permanently positioned as “the inconvenient one,” with low adherence and high discontinuation.

RxLaunch found what no one thought to look for, quantified what might have been dismissed, and delivered it before viral spread made the problem unfixable. In this case, the speed of discovery and response to new insights drew the line between success and failure.

What make RxLaunch Different

The refrigeration myth was just one of many insights brought to light by RxLaunch. While competing product launch monitoring solutions rely on crude algorithms that struggle to find patterns beyond what they are told to look for, RxLaunch takes a fundamentally different approach. Our AI agents work like expert human analysts – discovering and organizing issues in real-time, making intelligent decisions at every step.

These agents don’t just search for predefined problems – they autonomously identify what’s actually driving your KPIs, even unexpected issues you never thought to monitor. As new patterns emerge in your feedback, the agents adapt and refine their analysis automatically.

The result: comprehensive, intelligent, actionable insights from day one. No blind spots from forgotten categories. No critical patterns buried in the noise of massive unstructured datasets. Just intelligent agents working around the clock to surface the issues that matter most for launch success.

Who This Is For

RxLaunch is designed for pharmaceutical teams responsible for product launch success:

Launch Directors who need real-time visibility into emerging issues that could derail commercial performance during the critical early weeks
Brand Teams who must protect market positioning and respond quickly to unexpected competitive dynamics or patient perception shifts
Medical Affairs who handle complex medical information inquiries and need to identify systemic clinical or product questions emerging across touchpoints
Patient Support Teams who manage frontline feedback but lack tools to quickly identify patterns that signal broader adherence or satisfaction issues
Commercial Operations Leaders who must make fast, data-driven decisions to protect launch investments and adjust tactics in real-time
Market Access Teams who need to understand and address payer, pharmacy, and access barriers before they become widespread adoption blockers
Field Teams who need actionable intelligence about prescriber concerns and patient feedback to guide their conversations and remove obstacles to uptake

If your team is responsible for a product launch and you’re swimming in feedback data but struggling to identify which issues actually matter, consider RxLaunch. It transforms unstructured data into clear, prioritized action items – before problems become unfixable.

Your next launch is coming. The feedback will accumulate. The critical issues will emerge in early data. The only question: will you have insights when the launch opportunity window is still open, or will you only receive a report explaining what went wrong when it’s too late?

See how RxLaunch works

Read more about the AI agents powering RxLaunch

The post RxLaunch: AI Agents for Pharma Launch Monitoring appeared first on Megaputer Intelligence.

What is Just-in-Time Taxonomy?

Sergei Ananyan — Tue, 23 Sep 2025 21:35:00 +0000

Today is the big day. Maybe your product just launched, your new feature went live, or the results of the survey are in. Thousands of comments, reviews, support tickets (or whatever other data you can imagine) are flooding your system, and your team desperately needs to make sense of it all. But first, someone has to organize all this feedback into a meaningful form.

Using traditional methods, a team might need 3-4 weeks to taxonomize the data, assuming they can work on the problem full-time. By then, the crucial issues will have festered, opportunities will have passed, and the critical window for rapid response will have already slammed shut. You can also use a pre-built taxonomy, but this will capture only the categories that you were able to anticipate in advance.

An AI-powered Just-in-Time Taxonomy system solves this problem. Instead of spending weeks manually creating categories or trying to force your data into prechosen boxes where it does not really fit, the Just-in-Time Taxonomy system employs a pair of collaborating AI agents to analyze your data and build a custom taxonomy in under an hour. This gives you a taxonomy that was built from your own data on day one.

But how does this work? And when does it make sense to use this approach instead of traditional methods?

How it works: A two-agent system

The Just-in-Time taxonomy system uses two specialized AI agents that work in tandem to analyze your data.

The Category Building Agent (aka Category Builder) is responsible for discovering what categories actually exist in your data. It works through a three-stage process: first, it reads through your content and generates candidate labels for the themes it finds. For example, if your data was customer feedback about a mobile app, the agent might identify candidate labels like “battery drain,” “login problems,” “slow loading,” “confusing navigation,” “crashes on startup,” “great design,” “helpful support,” and dozens of others as it processes each piece of feedback.

The Category Builder is a three-stage pipeline

In the second stage, the agent groups these candidate labels through semantic clustering. It recognizes when different labels are referring to the same concept. For example, if customers mentioned “battery life,” “battery drain,” and “poor battery performance,” these would be clustered together because they’re all talking about the same issue using different words. Similarly, “log in issues,” “login problems,” and “can’t sign in” would group into one cluster, while “crashes,” “app crashes,” and “freezing” might form another cluster.

Finally, in the third stage, the agent selects the best representative label for each cluster. From the battery-related cluster, it might choose “Battery Issues” as the clearest label. From the login cluster, it might select “Login Problems” as the most straightforward term.

This process eliminates redundancy and creates a clean taxonomy where each category represents a distinct concept, even when customers express the same issue in many different ways.

The Document Tagging Agent (aka Document Tagger) takes any new piece of content and maps it to the appropriate categories from the taxonomy the Category Builder created. Think of it as the agent that actually does the organizing work once the categories exist.

The Document Tagger maps records to the categories discovered by the Category Builder

When a new piece of feedback comes in—let’s say a customer writes “This app is killing my phone’s battery and takes forever to load anything”—the Document Tagger reads this content and identifies which categories apply. In this case, it would likely tag this feedback with both “Battery Issues” and whatever category the system created for loading/performance problems.

The Document Tagger doesn’t just look for exact keyword matches. It understands the meaning and context of the content. So if someone writes “My phone dies so fast when I use this app,” the agent recognizes this as a battery issue even though they didn’t use the word “battery.” Similarly, “The app is super sluggish” would get tagged appropriately even though “sluggish” wasn’t in the original candidate labels.

This agent can process content as it arrives, instantly categorizing new feedback, support tickets, or survey responses using the taxonomy that emerged from your initial dataset. It can also handle cases where a single piece of content belongs in multiple categories—like our example above that touched on both battery and performance issues.

Building a Synergy: Agentic Collaboration

A snapshot of a three-level taxonomy

The Category Builder and the Document Tagger work together in a closed loop to build multi-level taxonomies that capture both broad themes and specific details.

Let’s continue with our mobile app example. During the initial three-stage process, the Category Builder might create a top-level category like “Performance Issues.” But as the Document Tagger processes more content, it notices that “Performance Issues” is getting tagged very frequently and contains distinct sub-themes.

When the Category Builder runs another round of analysis, it discovers that Performance Issues actually break down into: “Battery Drain,” “Slow Loading,” and “App Crashes.” The Category Builder creates this second level of hierarchy, with “Performance Issues” as the parent category and these three as subcategories.

Now the Document Tagger can work with this richer structure. When new feedback comes in saying “The app froze three times today and completely drained my battery,” the agent can tag it at multiple levels: the broad “Performance Issues” category, and the specific “App Crashes” and “Battery Drain” subcategories.

This collaborative process can continue recursively. If “App Crashes” becomes heavily populated, the agents might discover it breaks down further into “Crashes on Startup,” “Crashes During Use,” and “Crashes When Closing.” The taxonomy grows more sophisticated as it encounters more data, always staying grounded in what’s actually present in your content.

Use Cases: Beyond Product Launch Monitoring

While product launch feedback is a compelling example, Just-in-Time taxonomies (for short, JIT taxonomies) solve organizational challenges across many different types of unstructured text data.

Survey Analysis represents one of the most powerful applications. Open-ended survey responses often contain the richest insights, but they’re also the hardest to analyze at scale. Traditional approaches require manually reading through hundreds or thousands of responses to identify themes, then coding each response by hand. With Just-in-Time taxonomies, the agents automatically discover the themes present in your responses and categorize them accordingly. More importantly, you get to see the entire response distribution—not just summary statistics, but how responses cluster around different themes and what the full spectrum of sentiment looks like within each category.

Social Listening at Scale becomes feasible in ways that weren’t possible before. Because the taxonomies are auto-generated, you can monitor thousands of influencers, social media accounts, and online conversations simultaneously. Traditional social listening requires pre-defining what topics and keywords to track, which means you miss unexpected conversations and emerging themes. With Just-in-Time taxonomies, you can cast a wide net across your entire industry ecosystem and let the system discover what people are actually talking about, including topics you hadn’t thought to monitor.

Voice of Customer (VOC) programs generate massive amounts of unstructured feedback from multiple touchpoints: social media mentions, review sites, email feedback, focus groups, and customer interviews. Each of these may get their own taxonomy, or alternatively, a single monolithic taxonomy may be built that unifies all of these different data sources under a single organizational structure.

Customer support conversation analysis is another natural fit. Support tickets, chat logs, and call transcripts contain valuable information about recurring issues, but this knowledge often stays trapped in individual conversations. By automatically categorizing support interactions, teams can quickly identify systemic problems, track resolution patterns, and understand which issues are driving the most contact volume.

The common thread across all these applications is the same fundamental challenge: you have large volumes of text that need to be tagged, structured, and made searchable so that patterns become visible and actionable insights can emerge. Whether it’s understanding why customers are calling support, discovering unexpected themes in survey data, or tracking sentiment across multiple feedback channels, Just-in-Time taxonomies transforms unstructured text into organized, analyzable data.

Enriching the Taxonomy: Finding the KPI drivers

Creating the taxonomy can be just the first step in a larger business solution. The next step is enriching the taxonomy by calculating “importance scores” that reveal which categories actually matter for your business outcomes. We do this by using the taxonomic categories as predictors of a key performance metric—like customer satisfaction scores in survey data, or support quality ratings in customer service conversations. The strongest predictors are the most important drivers of your KPI. If these performance indicators are not initially present in the dataset, we may derive them using AI.

When the categories are sorted by importance scores, the most important categories are visually obvious. It is often a good strategy to go down the list in order, addressing the issues identified by the AI in the order that they occur. This simple strategy can have profound effects on KPIs.

In one case, a software company wanted to improve its customer support, but had thousands of support conversations with no quality ratings. We used AI to evaluate each conversation and generate a “support quality” score.

Once we had the AI-derived quality score, we used the taxonomic categories as predictors to see which conversation topics were associated with poor support outcomes. The results were surprising. The company expected complex technical issues to drive poor support quality, but “password resets” appeared as the strongest predictor of low-quality interactions.

Top customer satisfaction score (CSAT) drivers ranked by importance

Drilling down revealed that password reset conversations consistently involved long hold times, multiple transfers between agents, and frustrated customers. The company discovered that its password reset system was broken in a way that made simple requests unnecessarily complicated, but because these seemed like “easy” tickets, management had never paid attention to them.

They fixed the underlying system issues and streamlined the password reset process. As a result of fixing this single issue, overall support quality scores improved by 20%, and the volume of password-related tickets dropped by 60%. After all issues had been addressed, support quality improved by 40%.

Without the AI-derived quality metric and taxonomic analysis, they would have continued focusing on complex technical issues while a simple broken system was quietly frustrating thousands of customers with what should have been routine requests.

When do you need a JIT taxonomy?

Just-in-Time (JIT) taxonomies fill a gap that human taxonomists cannot fill: If you need data-driven insights faster than a human taxonomist can provide them, a JIT taxonomy is the only option. That makes them perfect for time-sensitive analysis, where waiting weeks for manual categorization means missing the opportunity entirely. But the applications extend far beyond just speed.

JIT taxonomies really shine when you’re exploring uncharted territory—entering new markets, launching innovative products, or researching unfamiliar customer segments where you genuinely don’t know what themes will emerge. Traditional taxonomies require you to define categories upfront, but sometimes the most valuable insights come from patterns you never would have anticipated. This makes them invaluable for competitive intelligence, emerging trend detection, or any situation where you’re trying to understand “what’s really going on” rather than confirming what you already suspect.

They’re also ideal for evolving content streams where the conversation itself is constantly changing. Social media discussions, ongoing customer feedback, live chat transcripts, and user-generated content don’t stay static—new themes emerge, language evolves, and yesterday’s taxonomy might miss today’s most important topics. A JIT taxonomy adapts as these changes happen, making it perfect for brand monitoring, community management, and any application where staying current matters more than maintaining rigid consistency.

The approach works particularly well with moderate-scale datasets—typically thousands to tens of thousands of documents. This covers most business scenarios perfectly: quarterly feedback analysis, event-driven research, pilot program evaluation, or departmental data organization. It’s large enough that manual analysis becomes impractical, but focused enough that AI can identify genuinely meaningful patterns rather than getting lost in noise.

JIT taxonomies are also excellent for proof-of-concept work. Before committing to a major taxonomy project, they can quickly demonstrate what’s possible with your specific data. This makes them invaluable for pilot programs, budget justification, or simply exploring whether text analysis will provide value for your particular use case. And when you’re resource-constrained—whether that means lacking dedicated research staff, facing tight deadlines, or working within a limited budget—they can deliver insights that would otherwise require weeks of manual work or expensive consulting engagements.

JIT taxonomies are less suitable for massive enterprise taxonomies requiring thousands of predefined categories, situations demanding strict compliance with industry standards or regulatory frameworks, or long-term strategic classification systems where consistency across years matters more than speed. If you’re integrating with established enterprise systems or building something that needs to conform to specific legal or industry requirements, traditional approaches may be more appropriate.

But rather than viewing this as an either-or choice, many organizations find the most value in hybrid approaches where JIT taxonomies extend and enhance traditional human-built structures. Consider a large enterprise that has spent years developing a comprehensive taxonomy for customer feedback—they have well-established top-level categories like “Product Quality,” “Service Experience,” and “Billing Issues.” These broad categories work well and align with their reporting structures and business processes.

However, at the detailed level, their traditional taxonomy might categorize everything under “Product Quality” without capturing the specific nuances that emerge in real customer feedback. This is where JIT taxonomies excel. By running the system specifically on feedback already categorized as “Product Quality,” it can automatically discover granular subcategories that reflect what customers are actually saying: “Packaging Damage,” “Assembly Instructions,” “Color Accuracy,” “Durability After 6 Months,” and dozens of other specific issues that human taxonomists might not have anticipated or prioritized.

These AI-generated extensions can serve end users directly, giving them immediate access to detailed categorization while the broader enterprise taxonomy remains stable and compliant. Alternatively, the JIT results can guide human taxonomists, showing them exactly where their existing categories are insufficient and what new subcategories would capture the most meaningful distinctions in the data. Instead of guessing which detailed categories to add, they can see empirically what themes are emerging and how frequently they appear.

This approach gives organizations the best of both worlds: the governance and consistency of traditional taxonomies at the strategic level, combined with the responsiveness and data-driven precision of JIT taxonomies where detailed categorization matters most.

Conclusion: The Future of Data Organization

Just-in-Time taxonomies represent a fundamental shift in how we approach unstructured data. Instead of forcing your data into predetermined boxes or waiting weeks for manual analysis, you can let the data reveal its own structure and start extracting insights immediately.

The technology works because it mirrors how human experts actually think about categorization—identifying patterns, grouping similar concepts, and refining understanding as more information becomes available. But it does this at machine speed and scale, processing thousands of documents in the time it would take a human to read dozens.

Whether you’re launching a product, analyzing survey responses, monitoring social conversations, or trying to understand customer support patterns, the fundamental challenge remains the same: transforming unstructured text into actionable intelligence. Just-in-Time taxonomies make this transformation immediate rather than eventual, comprehensive rather than sampled, and adaptive rather than static.

The question isn’t whether AI will change how we organize and analyze text data—it already has. The question is whether you’ll take advantage of these capabilities to understand your data while the insights still matter, or continue waiting for perfect taxonomies that arrive too late to drive decisions.

Would you like to see how JIT taxonomy works? Interested in trying it on a sample of your data? Let’s talk!

The post What is Just-in-Time Taxonomy? appeared first on Megaputer Intelligence.

Data Curation Made Easy: Hybrid AI Approach

Sergei Ananyan — Tue, 03 Jun 2025 20:19:47 +0000

Across industries, businesses constantly struggle to extract valuable insights from complex documents such as financial reports or research papers. These documents often contain a mix of dense text, tables packed with data, and intricate charts. Manually pulling insights from such material is not only tedious but also slows down decision-making, introduces risk, and can limit a company’s ability to stay ahead in fast-paced markets.

Now, imagine a solution that automates this process. Unstructured and semi-structured documents can be instantly transformed into actionable data, and your team can easily change the focus of analysis. This is the power of next-generation data curation, which is fundamentally changing how businesses approach information analysis and reporting.

Consider the pharmaceutical industry as an example. Picture a data analyst at a leading pharma company responsible for reviewing each month dozens of research papers describing results of clinical trials. These reports are filled with highly technical text, detailed tables, and a variety of charts. In the past, the analyst might have spent weeks extracting, organizing, and interpreting this information before presenting it to decision-makers. This manual approach is slow and increases the risk of overlooking critical information.

We developed a Hybrid AI solution that addresses these challenges. By integrating the robust analytical platform PolyAnalyst with advanced Generative AI models, such as ChatGPT and Claude, we have created a solution that automates and accelerates the entire data curation process. Organizations in finance, pharma, life sciences, academia, and other sectors can now analyze complex documents with ease. Vast volumes of unstructured information get converted into actionable intelligence in a fraction of the usual time, requiring little human intervention.

Figure 1. AI Data Curation solution overview

How It Works

AI Data Curation solution uses a streamlined workflow organized into three main stages:

Stage 1: Converting documents to machine readable form

Object separation:
When the user uploads documents to the platform, PolyAnalyst initiates the analysis workflow. Using a Vision-Language Model (VLM), the system detects and separates plain text, tables, and charts on each page.

Figure 2. Detecting objects on a page
OCR:
Neural network-based OCR engine converts textual and tabular content into machine-readable form.
Figure Interpretation:
A VLM interprets graphical figures, such as line charts or bar charts and converts them into structured data tables, retaining their positions in the original document.

Figure 3. Converting chart to table

Stage 2: AI-driven Analysis

Stating Facts of Interest:
The user describes target facts of interest through a simple interface.
Extracting facts of interest:
An LLM finds and extracts facts specified by the user.

Stage 3: Results Integration and Presentation

Data Integration and Post-processing:
Facts extracted by an LLM are cleansed and organized into unified datasets using sophisticated post-processing algorithms available in PolyAnalyst. This provides coherent insights for the entire document.
Indexing and highlighting:
The solution automatically creates linguistic rules that help navigate the original documents and highlight patterns supporting the performed fact extraction.
Figure 4. Linguistic rules for pattern highlighting
Interactive Web Reports:
Users can access graphical reports, view the extraction results alongside the original documents with highlights, and validate results of extraction using an interactive interface. This enhances transparency and trust.

Figure 5. Interface for navigating and validating the results of AI-based extraction

Key Advantages of AI Data Curation

Efficiency: The automated extraction process dramatically reduces time and manual effort.
Precision: Robust data transformation and analysis software works alongside state-of-the-art Generative AI to provide highly accurate extraction and analysis.
Scalability: The system can process numerous complex documents at once, making it suitable for both small projects and enterprise-level data initiatives.
Simple Manual Validation: An interactive interface enables easy validation, which minimizes errors and ensures high-quality results.
Minimal Human Intervention: Our multi-model AI approach allows different AI engines to collaborate seamlessly and share data and insights automatically. The user only specifies a collection of facts of interest.
Broad Applicability: The user can easily specify a different set of facts of interest to make the solution solve a different task. This makes the solution easily adaptable to a wide range of industries and document types. Moreover, in the “free hunting” mode (to be described in a separate blog article soon) the system requires little to no domain-specific input from users. Instead, it can use the contents of documents themselves to extract and organize facts of potential interest and then present them to the user.

By blending analytical capabilities of PolyAnalyst with advanced Generative AI, our AI Data Curation solution automates and simplifies the process of extracting facts of interest from complex documents. Actionable insights are delivered faster and with higher accuracy.

To see AI Data Curation solution in action, request a demo.

The post Data Curation Made Easy: Hybrid AI Approach appeared first on Megaputer Intelligence.

Retrieval Augmented Generation (RAG) in simple business terms

Sergei Ananyan — Tue, 03 Jun 2025 15:00:34 +0000

1. What Is RAG?

Imagine pairing a super‑fast semantic search engine with a large language model (LLM). The search engine grabs the most relevant facts found in the loaded corpus of documents; the LLM turns those facts into clear answers. That combo is Retrieval‑Augmented Generation (RAG).

2. How Does It Work?

Retrieve – The system quickly searches your chosen knowledge sources (public sites, private docs, etc.) and pulls out the fragments most related to the question.
Augment – The matching fragments are presented to the LLM in a prompt so the model “sees” them.
Generate – The LLM writes an answer grounded strictly in these fragments and provides reference to the right places in the original documents.

Think of it as having a clever librarian (retriever) who hands the expert writer (LLM) exactly the pages they need to produce an informative answer to the user question.

3. Why Your Business Cares

Up‑to‑Date Answers – Since it reads live data, RAG isn’t stuck with an old knowledge cutoff.
Uses Your Proprietary Info – It can securely tap your internal wikis, policies, or product docs.
Fewer Hallucinations – Grounding the model in real text sharply reduces the number of made‑up answers.
Transparent – The system can show where each fact came from.

4. Things to Keep in Mind

Retrieval Quality. Accurate answers depend on accurate source documents. Keep your content clean and current.
Speed vs. Depth. More sources or deeper checks add a bit of response time. We tune this to fit your use case.
Security. Access controls ensure only authorized users and docs are included.
Bias & Accuracy. The system reflects the documents it reads. We monitor and filter for fairness and correctness.

5. Key Takeaways

RAG combines search + generative AI so customers and staff get accurate, cited answers in seconds.
It lets you unlock the value of the information you already have without constant model retraining.
Done right, it is fast, secure, and dramatically reduces AI “hallucinations.”

Want to see RAG in action on your data? Contact our team for a demo.

The post Retrieval Augmented Generation (RAG) in simple business terms appeared first on Megaputer Intelligence.

Query languages—the Swiss army knife of information extraction

Echo Lu — Tue, 06 Feb 2024 05:19:41 +0000

Text mining, the art of extracting information from text, requires the formulation of efficient queries that retrieve information based on user input. To do this, the user requires a language for writing queries. For the most basic use cases, the language operators could be regex or string search. But while regex and string search are indispensable for text mining, their utility hits a hard ceiling when semantic or meaningful search is required. They cannot, for example, capture complex entities such as human names, corporations, and drugs. For handling tasks like these, we need a more powerful query language that has semantic understanding, such as Megaputer’s PDL.

So, what is PDL, and how does it achieve semantic understanding while regex and string search do not? Let’s take a look at an example to find out.

First of all, PDL does not just search for the literal form of the word in the query: instead, it automatically extends its search to all morphological forms of the word. For example, when searching for the word “company” in financial news articles, the PDL query will not only find “company”, but also “companies,” the plural form. This feature often comes in handy, especially when the search involves a verb. Suppose that you are interested in extracting what the CEOs said. With regex or other substring search, you will need to list all possible verb forms such as “say,” “saying,” “says,” and “said.” With PDL, simply entering “say” in the query will automatically fetch all possible verb forms. This behavior can also be turned off by enclosing the word in the form function, which will then restrict the search to the literal form of the word, such as in the example below.

Another notable feature of the PDL language is its capability for users to tailor the scope of their searches using a range of built-in functions. Returning to the previous example, you may not wish to confine your search exclusively to the specific verb “say,” but rather include other synonymous verbs like “tell” or “mention.” Achieving this is straightforward with PDL – users can invoke the synonym function with the verb “say,” as demonstrated in (a) below. As the subsequent results table (b) illustrates, the captured text now includes various speech verbs such as “tell,” “emphasize,” and “claim,” in addition to the word “say,” capturing them in all possible verb forms. For additional flexibility, the user can also create and modify synonym dictionaries.

The PDL language offers various modes of information extraction, including proximity search (e.g., finding words A and B within a sentence, or within a 3-words range), syntactic relation (e.g., finding word A that is the subject or object of B), semantic relation (e.g., finding words that are synonyms/antonyms to word A), access to dictionaries and ontologies, and more. This language is expressive enough to capture complex patterns, and yet relatively easy to use, having a syntax that closely resembles English. Having access to this versatile query language significantly enhances the power and quality of text mining operations.

In conclusion, PDL is a powerful and versatile query language that enables users to extract meaningful information from text with greater efficiency and accuracy than competing methods like regex or string search. Its ability to understand and capture morphological forms, synonyms, and other complex patterns makes it an indispensable tool for solving text mining tasks that require semantic understanding. By leveraging the capabilities of PDL, users can enhance their information extraction processes and gain valuable insights from their data, making it a true Swiss army knife of information extraction.

The post Query languages—the Swiss army knife of information extraction appeared first on Megaputer Intelligence.

What is Linguistic Corpus?

Echo Lu — Tue, 05 Dec 2023 20:55:10 +0000

At the core of Natural Language Processing lies extensive language data. The availability of copious amounts of text or speech, faithfully representing natural language, is pivotal for the efficacy of artificial language systems. Diverse forms of text, from Wikipedia articles to Twitter threads, find application in various language models. These substantial language datasets collectively constitute what is known as a corpus in linguistics—an extensive and principled collection of authentic, naturally occurring language.

Despite gaining increased attention with the advent of LLMs, the concept of a corpus is not new in linguistics, dating back to the early 20th century or earlier. During that time, lexicographers manually compiled substantial language samples for dictionary construction, representing a rudimentary form of a corpus. Beyond lexicographers and grammarians, linguists manually compiled corpora for various purposes, such as language pedagogy or the study of language acquisition in children. Moreover, the so-called “structural” linguists, who dominated the field until the 1950s, emphasized the utility of corpora, believing in the analytical approach to the distributional characteristics of language. With this idea, linguist Charles C. Fries, for example, manually gathered telephone conversations from approximately 300 speakers, encompassing 250,000 words. Field linguist Franz Boas dedicated himself to establishing a substantial corpus of texts from Native Americans.

In the 1960s, however, the first electronic corpus appeared. W. Nelson Francis and Henry Kučera, based at Brown University, crafted a corpus of 1 million American English words, famously known as the Brown corpus. Thereafter, linguistic corpora rapidly evolved and diversified in terms of size and scope, driven by advancements in computer hardware. For instance, the Corpus of Contemporary American English (COCA), currently one of the most widely used American English corpora, encompasses over one billion words sourced from eight different genres.

Types of Corpora and their use

Corpora can be broadly categorized into two types: generalized and specialized. A generalized corpus aims to represent all aspects of a language and typically consists of a large number of words and/or documents. Examples of generalized corpora include the British National Corpus and COCA, mentioned earlier. On the other hand, a specialized corpus is constructed to represent specific aspects of a language, such as language produced by particular user groups (e.g., children, teenagers, second language learners, aphasics) or in specific settings (e.g., university, business). The Child Language Data Exchange System (CHILDES), a large corpus of language produced by developing children, and the Professional English Research Consortium (PERC) corpus, which contains English academic journal texts in science, engineering, technology, and other fields, are examples of specialized corpora.

Corpora are frequently enriched with annotations containing linguistic information. Typically, texts within corpora undergo Part of Speech (POS) tagging, wherein the grammatical category of each word is identified. Beyond POS tagging, corpora can be further enhanced with annotations covering syntactic structure, semantic roles (denoting the roles played by each noun argument in a sentence, such as agent, patient, instrument, etc.), Named Entity Recognition (NER), sentiment analysis, and more.To illustrate, the Penn Treebank project annotates a collection of Wall Street Journal articles with both syntactic dependency information and POS tagging. Another type of syntactic corpus might be the Google Books Syntactic N-grams dataset, comprising syntactic tree fragments automatically extracted from the Google English Books collection. Some corpora are annotated with semantic information. The Proposition Bank, for example, provides a collection of English sentences that are manually annotated with various semantic roles. Some corpora also feature Named Entity Recognition. For example, GENETAG is a corpus of 20,000 sentences from the MEDLINE database annotated with gene/protein NER.

Use of Corpora in NLP

Corpora have traditionally served as tools to investigate specific aspects of language use, including collocation, frequency, preferred constructions, patterns of errors, and more. With the emergence of computer technology and machine learning algorithms, they have evolved into essential resources for model training and validation. Large Language Models are now trained on vast amounts of language data. For instance, OpenAI’s GPT-3 175B model is trained with 300 billion tokens gathered from sources such as Common Crawl, WebText2, and Wikipedia. Meta’s LLaMa 65B model, on the other hand, is trained with 1.4 trillion tokens sourced from various platforms, including English CommonCrawl, C4, Github, Wikipedia, ArXiv, and Stack Exchange. In our PolyAnalyst, diverse corpora like Open American National Corpus, the Penn Treebank, CoNLL Corpus, Europarl, the Russian National Corpus, and Wikipedia are employed for various modules. The utilization of large language datasets is poised to expand further in the future, making the adept handling of such data a crucial skill in the field of Natural Language Processing.

The post What is Linguistic Corpus? appeared first on Megaputer Intelligence.

Medical Ontology and Its Use in Text Analysis

Echo Lu — Tue, 24 Oct 2023 14:00:32 +0000

The complexity and variability of biomedical language pose a challenge in fields dealing with biomedical data. A vast array of terms, referring to various entities, including living organisms, biochemicals, drugs, biological pathways, diseases, and disease symptoms in patients, is utilized in biomedical research and closely associated healthcare industries. The intricacy inherent in this domain necessitates both a structured representation of knowledge and the use of a standardized language to describe these entities. Medical ontology serves as a valuable tool for fulfilling this crucial purpose.

What is Ontology?

Originally a philosophical study of existence, ontology, in the context of information science, refers to a formal representation of knowledge or information within a specific domain. Ontology provides a structured framework for organizing entities, their properties, and relationships. To illustrate, let’s consider the ontology of a country. As shown below, it encompasses entities such as states, cities, towns, and villages, along with their hierarchy and relationships. Interestingly, a similar ontology focused on geographical information serves as the foundation for the widely used Geographic Information System (GIS).

By extension, medical data is also highly complex—perhaps even more so—due to the vast diversity of biological entities and their interdependent relationships and interactions. This inherent complexity has prompted the development of numerous types of ontologies, which are designed to maintain consistency and accuracy in medical data, records, and research. At present, there are hundreds of medical ontologies, and you can find many of them by following this link.

There are numerous advantages to utilizing ontologies. For biomedical researchers, the use of medical ontologies can assist them in annotating and categorizing data with standardized terminology. This, in turn, leads to clear communication and makes it easier to identify relationships, patterns, and other elements within the data. For medical practitioners, ontologies facilitate the exchange of patient information among healthcare providers and entities, ensuring that data sent and received by different systems are interpreted consistently. This not only enhances the accuracy and consistency of medical information but also promotes interoperability between different healthcare systems. Furthermore, it supports the development of intelligent systems for medical diagnosis and treatment.

Apart from the benefits in research and practice, medical ontology also plays a crucial role in text mining. There is a constant influx of biomedical text from various sources every day. However, due to the intricacies of biomedical concepts and the inherent ambiguity in natural language, precise data extraction from large and heterogeneous data sources is not feasible without a system of standardization. Medical ontology helps mitigate this ambiguity and variability in text data. Furthermore, medical ontology is also pivotal in the Semantic Web, where web data are represented through structured ontological representations, allowing interpretation by both humans and machines. In short, the use of medical ontology leads to more intelligent search, enhanced data integration, and improved knowledge discovery across a wide range of medical and healthcare-related resources.

An overview of commonly used medical ontologies

There are a number of ontologies that are particularly helpful when building text analysis solutions for the medical field. In what follows, we provide an overview of some of the more commonly used ontologies.

MeSH

The Medical Subject Headings (MeSH) dictionary is a hierarchical collection of medical terms describing various conditions, diseases, and symptoms. It was created by the US National Library of Medicine (NLM) at the National Institutes of Health (NIH) in the United States.

MeSH is utilized for indexing and searching biomedical information found in national medical databases, including MEDLINE/PubMed. Many biomedical articles are now indexed with MeSH terms, either manually or automatically by computer, enabling users to retrieve records on a specific subject with accuracy and efficiency.

The MeSH tree structure comprises 16 main branches, such as Anatomy, Organisms, Diseases, and more. Each branch is further divided into hierarchical levels, ranging from more general to more specific categories. These levels represent the taxonomic relationships between terms, as follows:

synonyms: terms with the same, or nearly the same, meaning. (e.g., “dwelling” and “residence” are synonyms of “home”)

hypernyms: terms with a broad meaning for a group of specific items. (e.g., “animal” is a hypernym for “dog” and “cat”)

hyponyms: terms with a more specific meaning than a general term. (e.g., “apple”, “banana”, and “orange” are hyponyms of “fruit”)

For example, Angina Pectoris, a medical term for chest pain due to coronary heart disease, would be mapped out as shown below:

ICD-10

The International Statistical Classification of Diseases and Related Health Problems (ICD) is a standardized system for coding medical diagnoses and procedures. It contains a description of all known diseases and injuries across languages and allows quick access to disease information, including symptoms, treatment, and mortality rates.

The ICD-10 system has been adopted by numerous countries worldwide and finds use in a wide array of healthcare settings, including hospitals, clinics, and various healthcare facilities. Fully implemented in 2015, ICD-10 boasts an impressive array of over 150,000 codes covering a broad spectrum of medical conditions and treatments. The ICD ontology continues to evolve, and while the next version, ICD-11, is already available for implementation, it is not expected to be fully launched until 2025 or later.

Codes in ICD-10 can have 3 to 7 characters, where the diagnosis category is the first 3 characters. The next set of characters in the code constitutes the related etiology, anatomic site, severity, or other details. The seventh character (if used) specifies any extension classifier describing the episode of care. For example, if a medical record stated that “the patient developed a chalazion of the left upper eyelid,” this would be coded as H00.14 in the ICD-10 coding system.

MedDRA

The Medical Dictionary for Regulatory Activities (MedDRA) is an ontology for classifying and sharing regulatory information. This medical ontology is typically used by pharmaceutical and biotechnology companies for FDA pharmacovigilance reporting on drugs and therapies. MedDRA is used for registration, documentation, and safety monitoring of medical products before and after a product has been authorized for sale. This includes pharmaceuticals vaccines and drug-device combination products.

MedDRA contains over 70,000 standard terms for various signs, symptoms, diseases, diagnoses, and indications, and is updated twice a year. It consists of a 5-level taxonomy, arranging symptoms and diseases from very specific to very general. For example, consider the condition known as “chest pain” (stenocardia). The 5-level hierarchical taxonomy describing this is summarized by the following animation.

MedDRA is dictated by the FDA in the United States. The FDA requires the use of MedDRA in the submission of safety reports and other regulatory documents related to drug development and approval. It’s widely recognized as a key standard for the classification and analysis of medical data in the pharmaceutical industry.

RxNorm

RxNorm is a comprehensive database containing over 100,000 medications and chemicals. It serves as a standardized nomenclature system that establishes connections between brand names and generic synonyms for clinical drugs. This mapping enables medical care providers to effectively manage the diverse vocabulary associated with drug and pharmaceutical products across various databases.

Moreover, RxNorm’s primary aim is to facilitate efficient and accurate communication of drug-related information among computer systems. It is diligently maintained and updated on a weekly basis by the National Library of Medicine (NLM) and encompasses all medications for humans available on the US market.

RxNorm serves as an exemplary instance of a highly intricate ontology, as demonstrated in the example below that showcases the connections for Zyrtec® (cetirizine hydrochloride), an over-the-counter antihistamine. This not only provides the standard drug names but also highlights the relationships between different drug entities.

Other sources for ontologies

A few additional sources of coding systems and ontologies include:

Current Procedural Terminology (CPT) codes: an ontology used to report procedures and diagnostic services to healthcare providers, insurance companies, and accredited institutions.
Systemized Nomenclature of Medicine – Clinical Terms (SNOMED-CT): an ontology used mainly by healthcare providers to share information in a more standardized manner.
Diagnosis-Related Group (DRG) codes: a system used to classify hospital cases into approximately 500 groups for the purpose of payment.
Gene Ontology Consortium: a set of frameworks for modeling complex biological systems and concepts.

How ontologies are used in PolyAnalyst™

PolyAnalyst™ utilizes ontologies to search for and extract pertinent information. Within PolyAnalyst™, ontologies are imported as distinct “semantic dictionaries,” which are then used by PDL (Pattern Definition Language) to extract text patterns that correspond to various medical entities and their relationships as defined within those semantic dictionaries. In addition to the existing ontologies, users have the ability to create their own ontologies.

PolyAnalyst can import and start working with any kinds of ontologies. Some of these ontologies can be used free of charge, others require the user to obtain a license from the ontology owner. Currently, the following ontologies are provided along with PolyAnalyst™ (out-of-the box):

Human Disease Ontology: provides a framework for organizing and integrating information related to diseases in humans.
Human Phenotype Ontology: provides a standardized vocabulary of phenotypic abnormalities encountered in human disease.
MeSH: covers a wide range of topics related to health and medicine.
RXNorm: contains information on medication names and their active ingredients.
WordNet: a broad semantic ontology that attempts to build a lexical database of English nouns, verbs, adjectives, and the relationships between them.

Other ontologies can be imported by the user (possibly, after acquiring the necessary license).

The Dictionary Manager in PolyAnalyst™ is the most direct way to access, edit, and view any imported ontologies in PolyAnalyst™. For example, you can import and view MeSH dictionary in PolyAnalyst™, searching for terms such as “CVA” to view all the associated information.

These ontologies can also be accessed within the project nodes. Some key nodes that utilize these types of ontologies include Entity Extraction and Taxonomy. Within these nodes, users have the option to select which ontology to import. Subsequently, they can craft a query to identify semantically related terms, such as synonyms, hyponyms, or hypernyms, or any other semantic relations, as defined within that particular ontology.

The example below illustrates how a query searching for medical conditions within the “influenza and pneumonia” category, as per ICD-10, can be simplified by leveraging the PDL functions integrated into PolyAnalyst™ in conjunction with the MeSH ontology. By employing the “hyponym” function, as shown below, it becomes possible to capture all the subtypes of influenza and pneumonia without the need to list them individually.

Lastly, one of the challenges of using ontologies is the issue of ontology cleaning, which involves removing or correcting errors, inconsistencies and irrelevant or outdated information in the ontology. This is particularly important in specialized domains, such as chemistry or toxicology, where the ontology may contain a large number of concepts and entities that are not relevant to the domain or may cause confusion or misinterpretation (i.e., “milk” or “coffee” being present within a chemical substance ontology). Thankfully, the task of modifying ontologies is quite straight-forward within the Dictionaries module of PolyAnalyst™.

Ontologies and beyond

Ontologies are powerful tools in information science, with potential that far exceeds the medical domain of the topics discussed here. Additional resources on using dictionaries in PolyAnalyst™ and how to import and utilize ontologies and specialized dictionaries can be found in our support and training materials.

The post Medical Ontology and Its Use in Text Analysis appeared first on Megaputer Intelligence.

Mastering Language Models: A Deep Dive into Input Parameters

Echo Lu — Thu, 28 Sep 2023 21:57:45 +0000

If you have ever used a language model through a playground or an API, you may have been asked to choose some input parameters. For many of us, the meaning of these parameters (and the right way to use them) is less than totally clear.

This article will teach you how to use these parameters to control hallucinations, inject creativity into your model’s outputs, and make other fine-grained adjustments to optimize behavior. Much like prompt engineering, input parameter tuning can get your model running at 110%.

By the end of this article, you’ll be an expert on five essential input parameters — temperature, top-p, top-k, frequency penalty, and presence penalty. You’ll also learn how each of these parameters helps us navigate the quality-diversity tradeoff.

So, grab a coffee, and let’s get started!

Background
Quality, Diversity, and Temperature
Top-k and Top-p
Frequency and Presence Penalties
The Parameter-Tuning Cheat Sheet
Wrapping up

Background

Before we start, we will need to go over some background information about how these models work. Let’s start our deep dive by reviewing the fundamentals.

To read a document, a language model breaks it down into a sequence of tokens. A token is just a small chunk of text that the model can easily understand: It could be a word, a syllable, or a character. For example, “Megaputer is great!” could be broken down into five tokens: [“Mega”, “puter ”, “is ”, “ great”, “!”]. This is done by the tokenizer.

Most language models we are familiar with operate by repeatedly generating the next token in a sequence. Each time the model wants to generate another token, it re-reads the entire sequence and then predicts the token that should come next. This strategy is known as autoregressive generation.

Autoregressive generation of tokens by a language model.

Contains a modification of an image by Annie Surla from NVIDIA.

This explains why ChatGPT prints the words out one at a time: It is streaming the words to you as it writes them.

To generate a token, a language model assigns a likelihood score to each token in its vocabulary. A token gets a high likelihood score if it is a good continuation of the text and a low likelihood score if it is a poor continuation of the text, as assessed by the model. The sum of the likelihood scores over all tokens in the model’s vocabulary is always exactly equal to one.

A language model assigns likelihood scores to predict the next token in the sequence.

Original image by Annie Surla from NVIDIA, modified with permission from owner.

After the likelihood scores are assigned, a token sampling scheme is used to pick the next token in the sequence. The token sampling scheme may incorporate some randomness so that the language model does not answer the same question in the same way every time. This randomness can be a nice feature in chatbots, as well as in some other applications.

TLDR: Language models break down text into chunks, predict the next chunk in the sequence, and mix in some randomness. Repeat as needed to generate output.

Quality, Diversity, and Temperature

But why would we ever want to pick the second-best token, the third-best token, or any other token besides the best, for that matter? Wouldn’t we want to pick the best continuation (the one with the highest likelihood score) every time? Often, we do. But if we picked the best answer every time, we would get the same answer every time. If we want a diverse range of answers, we may have to give up some quality to get it. This sacrifice of quality for diversity is called the quality-diversity tradeoff.

With this in mind, temperature tells the machine how to navigate the quality-diversity tradeoff. Low temperatures mean more quality, while high temperatures mean more diversity. When the temperature is set to zero, the model always samples the token with the highest likelihood score, resulting in zero diversity between queries, but ensuring that we always pick the highest quality continuation as assessed by the model.

Most commonly, we will want to set the temperature to zero for text analytics tasks. This is because, in text analytics, there is often a single “right” answer that we want to get every time. At temperature zero, we have the best chance of arriving at this answer in one shot. I like to set the temperature to zero for entity extraction, fact extraction, sentiment analysis, and most other things that I get up to in my job as an analyst. As a rule, you should always choose temperature zero for any prompt that you will only pass to the model once, as this is most likely to get you a good answer.

At higher temperatures, we see more garbage and hallucinations, less coherence, and lower quality of responses on average, but also more creative and diverse responses. We use temperatures higher than zero when we want to pass the same prompt to the model many times and get many creative responses. We recommend that you should only use non-zero temperatures when you want to ask the same question twice and get two different answers.

Quality Diversity Tradeoff

Temperature adds diversity, creativity, and multiplicity of answers but also garbage, incoherence, and hallucinations.

And why would we ever want two different answers to the same prompt? In some cases, having many answers to one prompt can be very useful. For example, there is a technique in which we generate many answers to a prompt and keep only the best one, which often produces better answers than a single query at temperature zero. Another use case is synthetic data generation: We want many synthetic data points, not just one. We may discuss these use cases (and others) in later articles, but more often than not, we need only one answer. When in doubt, choose temperature zero!

It is important to note that while temperature zero should in theory produce the same answer every time, this may not be true in practice! This is because the GPUs the model is running on can be prone to small miscalculations, such as rounding errors. These errors introduce a low level of randomness into the calculations, even at temperature zero. Since changing one token in a text can significantly alter its meaning, a single error may cause a cascade of different token choices later in the text, resulting in an almost totally different output. But rest assured that this usually has a negligible impact on quality. We only mention it so that you’re not surprised when you get some randomness at temperatures zero.

There are more ways to navigate the quality-diversity tradeoff than temperature alone. In the next section, we will discuss some modifications to the temperature sampling technique. But if you are content with using temperature zero, feel free to skip it for now. You may rest soundly knowing that your choice of these parameters at temperature zero will not affect your answer.

TLDR: Temperature increases diversity but decreases quality by adding randomness to the model’s outputs.

Top-k and Top-p sampling

One common way to tweak our token-sampling formula is called top-k sampling. Top-k sampling is a lot like ordinary temperature sampling, except that the lowest likelihood tokens are excluded from being picked: Only the “top k” best choices are considered, which is where we get the name. The advantage of this method is that it stops us from picking truly bad tokens.

Let’s suppose, for example, that we are trying to make a completion for “The sun rises in the…” Then, without top-k sampling, the model considers every token in its vocabulary as a possible continuation of the sequence. Then there is some non-zero chance that it will write something ridiculous like “The sun rises in the refrigerator.” With top-k sampling, the model filters out these truly bad picks and only considers the k best options. By clipping off the long tail, we lose a little diversity, but our quality shoots way up.

Top-k sampling

Top-k sampling improves quality by keeping only the k best candidate tokens and throwing out the rest.

Top-k sampling is a way to have your cake and eat it too: It gets you the diversity you need at a smaller cost to quality than with temperature alone. Since this technique is so wildly effective, it has inspired many variants.

One common variant of top-k sampling is called top-p sampling, which is also known as nucleus sampling. Top-p sampling is a lot like top-k, except that it uses likelihood scores instead of token ranks to determine where it clips the tail. More specifically, it only considers those top-ranked tokens whose combined likelihood exceeds the threshold p, throwing out the rest.

The power of top-p sampling compared to top-k sampling becomes evident when there are many poor or mediocre continuations. Suppose, for example, that there are only a handful of good picks for the next token, and there are dozens that just vaguely make sense. If we were using top-k sampling with k=25, we would be considering many poor continuations. In contrast, if we used top-p sampling to filter out the bottom 10% of the probability distribution, we might only consider those good tokens while filtering out the rest.

In practice, top-p sampling tends to give better results compared to top-k sampling. By focusing on the cumulative likelihood, it adapts to the context of the input and provides a more flexible cut-off. So, in conclusion, top-p and top-k sampling can both be used at non-zero temperatures to capture diversity at a lower quality cost, but top-p sampling usually does it better.

Tip: For both of these settings, lower value = more filtering. At zero, they will filter out all but the top-ranked token, which has the same effect as setting the temperature to zero. So please use these parameters, but be aware that setting them too low will give up all of your diversity.

TLDR: Top-k and top-p increase quality at only a small cost to diversity. They achieve this by removing the worst token choices before random sampling.

Frequency and Presence Penalties

We have just two more parameters to discuss before we start to wrap things up: The frequency and presence penalties. These parameters are — big surprise— yet another way to navigate the quality-diversity tradeoff. But while the temperature parameter achieves diversity by adding randomness to the token sampling procedure, the frequency and presence penalties add diversity by penalizing the reuse of tokens that have already occurred in the text. This makes the sampling of old and overused tokens less likely, influencing the model to make more novel token choices.

The frequency penalty adds a penalty to a token for each time it has occurred in the text. This discourages repeated use of the same tokens/words/phrases and also has the side effect of causing the model to discuss more diverse subject matter and change topics more often. On the other hand, the presence penalty is a flat penalty that is applied if a token has already occurred in the text. This causes the model to introduce more new tokens/words/phrases, which causes it to discuss more diverse subject matter and change topics more often without significantly discouraging the repetition of frequently used words.

Much like temperature, the frequency and presence penalties lead us away from the “best” possible answer and toward a more creative one. But instead of doing this with randomness, they add targeted penalties that are carefully calculated to inject diversity into the answer. On some of those rare tasks requiring a non-zero temperature (when you require many answers to the same prompt), you might also consider adding a small frequency or presence penalty to the mix to boost creativity. But for prompts having just one right answer that you want to find in just one try, your odds of success are highest when you set all of these parameters to zero.

As a rule, when there is one right answer, and you are asking just one time, you should set the frequency and presence penalties to zero. But what if there are many right answers, such as in text summarization? In this case, you have a little discretion. If you find a model’s outputs boring, uncreative, repetitive, or limited in scope, judicious application of the frequency or presence penalties could be a good way to spice things up. But our final suggestion for these parameters is the same as for temperature: When in doubt, choose zero!

We should note that while temperature and frequency/presence penalties both add diversity to the model’s responses, the kind of diversity that they add is not the same. The frequency/presence penalties increase the diversity within a single response. This means that a response will have more distinct words, phrases, topics, and subject matters than it would have without these penalties. But when you pass the same prompt twice, you are not more likely to get two different answers. This is in contrast with temperature, which increases diversity between responses: At higher temperatures, you will get a more diverse range of answers when passing the same prompt to the model many times.

I like to refer to this distinction as within-response diversity vs. between-response diversity. The temperature parameter adds both within-response AND between-response diversity, while the frequency/presence penalties add only within-response diversity. So, when we need diversity, our choice of parameters should depend on the kind of diversity we need.

TLDR: The frequency and presence penalties increase the diversity of subject matters discussed by a model and make it change topics more often. The frequency penalty also increases the diversity of word choice by reducing the repetition of words and phrases.

The Parameter-Tuning Cheat Sheet

This section is intended as a practical guide for choosing your model’s input parameters. We first provide some hard-and-fast rules for deciding which values to set to zero. Then, we give some tips to help you find the right values for your non-zero parameters.

I strongly encourage you to use this cheat sheet when choosing your input parameters. Go ahead and bookmark this page now so you don’t lose it!

Rules for setting parameters to zero:

Temperature:

For a single answer per prompt: Zero.
For many answers per prompt: Non-zero.

Frequency and Presence Penalties:

When there is one correct answer: Zero.
When there are many correct answers: Optional.

Top-p/Top-k:

With zero temperature: The output is not affected.
With non-zero temperature: Non-zero.

If your language model has additional parameters not listed here, leave them at their default values.

Tips for tuning the non-zero parameters:

Make a list of those parameters that should have non-zero values, and then go to a playground and fiddle around with some test prompts to see what works. But if the rules above say to leave a parameter at zero, leave it at zero!

Tuning temperature/top-p/top-k:

For more diversity/randomness, increase the temperature.
With non-zero temperatures, start with a top-p around 0.95 (or top-k around 250) and lower it as needed.

Troubleshooting:

If there is too much nonsense, garbage, or hallucination, decrease temperature and/or decrease top-p/top-k.
If the temperature is high and diversity is low, increase top-p/top-k.

Tip: While some interfaces allow you to use top-p and top-k at the same time, we prefer to keep things simple by choosing one or the other. Top-k is easier to use and understand, but top-p is often more effective.

Tuning frequency penalty and presence penalty:

For more diverse topics and subject matters, increase the presence penalty.
For more diverse and less repetitive language, increase the frequency penalty.

Troubleshooting:

If the outputs seem scattered and change topics too quickly, decrease the presence penalty.
If there are too many new and unusual words, or if the presence penalty is set to zero and you still get too many topic changes, decrease the frequency penalty.

TLDR: You can use this section as a cheat sheet for tuning language models. You are definitely going to forget these rules, so bookmark this page and use it later as a reference.

Wrapping up

There is no limit to the possible token-sampling strategies out there. Other notable strategies include beam search and adaptive sampling. However, the ones we’ve discussed here — temperature, top-k, top-p, frequency penalty, and presence penalty — are the most commonly used parameters. These are the parameters that you can expect to find in models like Claude, Llama, and the GPT series. In this article, we have shown that all of these parameters are really just here to help us navigate the quality-diversity tradeoff.

Before we go, there is one last input parameter to mention: maximum token length. The maximum token length is just the cutoff where the model stops printing its answer, even if it is not finished responding. After this complex discussion, we hope this one is self-explanatory. ?

As we move further in this series, we’ll do more deep dives into advanced topics. Future articles will discuss prompt engineering, choosing the right language model for your use case, and more! Stay tuned, and keep exploring the vast horizons of language models with Megaputer Intelligence.

TLDR: When in doubt, set the temperature, frequency penalty, and presence penalty to zero. If that doesn’t work for you, reference the cheat sheet above.

The post Mastering Language Models: A Deep Dive into Input Parameters appeared first on Megaputer Intelligence.

Megaputer Intelligence

Can AI Understand Your Customers’ Needs?

What Is Deep.Listen?

A Real-World Example: Finding Hidden Liabilities

How It Works

The Applications Are Broad

Ready to See What’s Hiding in Your Data?

Webinar: Accelerating Drug Discovery with AI Agents

RxLaunch: AI Agents for Pharma Launch Monitoring

Catch issues before they derail your product launch.

The Problem: When Insights Arrive Too Late

RxLaunch™ Provides Timely Insights

How One Pharma Company Saved Its Launch

Rapid Response and Recovery

What make RxLaunch Different

Who This Is For

What is Just-in-Time Taxonomy?

How it works: A two-agent system

Building a Synergy: Agentic Collaboration

Use Cases: Beyond Product Launch Monitoring

Enriching the Taxonomy: Finding the KPI drivers

When do you need a JIT taxonomy?

Conclusion: The Future of Data Organization

Data Curation Made Easy: Hybrid AI Approach

How It Works

Stage 1: Converting documents to machine readable form

Stage 2: AI-driven Analysis

Stage 3: Results Integration and Presentation

Key Advantages of AI Data Curation

Retrieval Augmented Generation (RAG) in simple business terms

Query languages—the Swiss army knife of information extraction

What is Linguistic Corpus?

Types of Corpora and their use

Use of Corpora in NLP

Medical Ontology and Its Use in Text Analysis

What is Ontology?

An overview of commonly used medical ontologies

MeSH

ICD-10

MedDRA

RxNorm

Other sources for ontologies

How ontologies are used in PolyAnalyst™

Ontologies and beyond

Mastering Language Models: A Deep Dive into Input Parameters

Table of Contents

Background

Quality, Diversity, and Temperature

Top-k and Top-p sampling

Frequency and Presence Penalties

The Parameter-Tuning Cheat Sheet

Rules for setting parameters to zero:

Tips for tuning the non-zero parameters:

Wrapping up

**What Is Deep.Listen?**