Pick the Perfect LLM with R and Performance Data

Leverage the vitals package alongside ellmer to assess and benchmark the precision of Large Language Models (LLMs), encompassing the creation of evaluations for local model testing.

Credit: Phalexaviles/Shutterstock

Does your generative artificial intelligence application consistently yield the desired outputs? Could more budget-friendly LLMs—or even complimentary, locally runnable alternatives—adequately serve certain functions within your workflow?

Finding definitive answers to such inquiries often presents challenges. LLM functionalities appear to evolve monthly, and distinct from traditional software code, these models rarely produce identical responses on repeat queries. Consequently, conducting and repeating tests can prove both monotonous and time-intensive.

Fortunately, systems exist to streamline the automation of LLM assessments. These specialized LLM “evaluations” (or “evals”), as they’re commonly termed, bear a resemblance to the unit tests employed for conventional programming. However, unlike standard unit tests, evals must acknowledge that LLMs can generate diverse answers to the same query, and that multiple responses might be valid. This implies that such testing frequently demands the capacity to evaluate adaptable criteria, rather than merely verifying an exact match to a predefined value.

The vitals package, built upon Python’s Inspect framework, introduces automated LLM evaluations to the R programming environment. Vitals was specifically engineered for integration with the ellmer R package, enabling their combined use to assess prompts, AI applications, and the impact of various LLMs on both performance and operational costs. For instance, in one notable case, it helped reveal that AI agents often disregard information presented in plots when it contradicts their preconceived notions, as reported by package author Simon Couch, a senior software engineer at Posit. Couch noted via email that the experiment, conducted using a suite of vitals evaluations dubbed bluffbench, “truly resonated with many individuals.”

Couch is additionally utilizing the package to gauge the proficiency of different LLMs in generating R code.

Vitals Configuration

You can acquire the vitals package from CRAN or, for the development iteration, from GitHub using pak::pak("tidyverse/vitals"). As of this writing, accessing several features demonstrated in this article’s examples, including a dedicated function for extracting structured data from text, requires the dev version.

Vitals employs a Task object to formulate and execute evaluations. Each task necessitates three primary components: a dataset, a solver, and a scorer.

Dataset

A vitals dataset manifests as a data frame containing relevant information for your testing objectives. This data frame must feature at least two columns:

input: The specific request intended for the LLM.

target: The anticipated response from the LLM.

The vitals package incorporates a sample dataset named are. This particular data frame includes additional columns, such as id (always a prudent inclusion in your data), though these are entirely optional.

As Couch informed attendees at posit::conf a few months prior, one of the simplest methods for constructing your own input–target pairings for a dataset involves entering your desired content into a spreadsheet. Establish spreadsheet columns labeled “input” and “target,” populate them as needed, then import that spreadsheet into R using a package like googlesheets4 or rio.

Example of a spreadsheet to create a vitals dataset with input and target columns.
Sharon Machlis

Presented below is the R code for three straightforward queries I will use to test vitals. The code directly generates an R data frame, allowing you to copy and paste to follow along. This dataset instructs an LLM to generate R code for a bar chart, ascertain the sentiment of certain text, and compose a haiku.

my_dataset <- data.frame(
id = c("barchart", "sentiment-analysis", "haiku"),
input = c(
"Write R code using ggplot2 to create a bar chart with bars that are colored steel blue. Create your own sample data, with values larger than 1,000 so the axis labels will need commas. Remove all the grid lines and grey background from the graph, and make the background white. Put the category names on the Y axis and values on the X axis, so the bars are horizontal instead of vertical. Sort the bars by value in descending order with the largest value at the top. Value axis labels should include commas. Make your code efficient and elegant. ONLY return runnable R code, with any comments or explanations included as R comments.",
"Determine whether the following review text has a sentiment of Positive, Negative, or Mixed: This desktop computer has a better processor and can handle much more demanding tasks such as running LLMs locally. However, it\U{2019}s also noisy and comes with a lot of bloatware.",
"Write me a haiku about winter"
),
target = c(
'Example solution: library(ggplot2)\r\nlibrary(scales)\r\nsample_data <- data.frame(\r\n category = c("Laptop", "Smartphone", "Tablet", "Headphones", "Monitor"),\r\n revenue = c(45000, 62000, 28000, 15000, 33000)\r\n)\r\nggplot(sample_data, aes(x = revenue, y = reorder(category, revenue))) +\r\n # Add steel blue bars\r\n geom_col(fill = "steelblue") +\r\n theme_classic() +\r\n scale_x_continuous(labels = label_comma()) +\r\n theme(\r\n panel.grid = element_blank(),\r\n axis.line = element_line(color = "black")\r\n ) +\r\n labs(\r\n x = "Value",\r\n y = "Category",\r\n title = "Product Revenue - Sorted by Performance"\r\n )',
"Mixed",
"Response must be a 3-line poem with 5 syllables in the 1st line, 7 in the 2nd line, and 5 in the 3rd line. It must include some sort of theme about winter."
)
)

Subsequently, I will load my necessary libraries and establish a logging directory for when I execute evaluations, as the package will prompt this action immediately upon loading:

library(vitals)
library(ellmer)
vitals_log_dir_set("./logs")

Here begins the configuration of a new Task utilizing the dataset, though this code will generate an error without the solver and scorer, which are the other two mandatory arguments.

my_task <- Task$new(
dataset = my_dataset
)

Should you prefer a pre-built illustration, you may use dataset = are with its seven R tasks.

Devising suitable sample targets can demand considerable effort. The classification instance was straightforward, as it required a single-word response, “mixed.” However, other queries might elicit more free-form answers, such as code generation or text summarization. Do not rush this phase—if your automated “judge” is to grade accurately, careful design of acceptable responses is paramount.

Solver

The second element of the task, the solver, is the R code responsible for dispatching your queries to an LLM. For uncomplicated queries, you can typically encapsulate an ellmer chat object within the vitals generate() function. If your input is more intricate, requiring tool invocation, a custom solver might be necessary. For this segment of the demonstration, I will employ a standard solver with generate(). Later, we will incorporate a second solver using generate_structured().

Familiarity with the ellmer R package is beneficial when working with vitals. Below is an example demonstrating ellmer usage independently of the vitals package, with my_dataset$input[1], the inaugural query from my dataset data frame, serving as the prompt. This code yields a response to the question but does not evaluate it.

Note: An OpenAI key is required if you intend to run this specific code. Alternatively, you may adjust the model (and API key) to any other LLM supported by ellmer. Ensure all necessary API keys for other providers are securely stored. For the LLM, I opted for OpenAI’s most economical current model, GPT-5 nano.

my_chat <- chat_openai(model = "gpt-5-nano")
my_query <- my_dataset$input[1] # That variable holds the text of the first query I want to evaluate, in the dataset I created above
result <- my_chat$chat(my_query)

You can transform that my_chat ellmer chat object into a vitals solver by wrapping it within the generate() function:

# This code won't run yet without the tasks's third required argument, a scorer
my_task <- Task$new(
dataset = my_dataset,
solver = generate(my_chat) # Using my_chat that was created above
)

The Task object is configured to utilize the input column from your dataset as the question to be sent to the LLM. If the dataset contains multiple queries, generate() manages their sequential processing.

Scorer

Finally, a scorer is indispensable. As its designation implies, the scorer assigns a grade to the outcome. Vitals offers several distinct scorer types. Two of these leverage an LLM to evaluate results, a method sometimes referred to as “LLM as a judge.” One of vitals’ LLM-as-a-judge options, model_graded_qa(), verifies the solver’s proficiency in answering a question. The other, model_graded_fact(), “determines whether a solver includes a given fact in its response,” according to the documentation. Other scorers search for string patterns, such as detect_exact() and detect_includes().

Certain studies indicate that LLMs can perform competently in evaluating outcomes. Nevertheless, akin to most endeavors involving generative AI, I maintain skepticism regarding LLM evaluations without human oversight.

Pro tip: If you are testing a smaller, less capable model in your evaluation, it is inadvisable for that same model to also grade the results. Vitals defaults to using the LLM under test as the scorer, but you possess the option to designate a different LLM as your judge. I typically prefer a top-tier frontier LLM for my judge unless the scoring process is inherently straightforward.

Below is an illustration of what the syntax might appear like if we were to employ Claude Sonnet as a model_graded_qa() scorer:

scorer = model_graded_qa(scorer_chat = chat_anthropic(model = "claude-sonnet-4-6"))

Observe that this scorer by default sets partial_credit to FALSE—meaning an answer is either entirely accurate or incorrect. However, you can opt to permit partial credit if appropriate for your task, by appending the argument partial_credit = TRUE:

scorer = model_graded_qa(partial_credit = TRUE, scorer_chat = chat_anthropic(model = "claude-sonnet-4-6"))

I initially utilized Sonnet 4.5 as my scorer, without allowing for partial credit. It erred on one of the gradings, awarding a correct score to R code that performed most aspects correctly for my bar chart but failed to sort in descending order. I subsequently experimented with Sonnet 4.6, released just this week, but it also made a grading mistake.

Opus 4.6 boasts greater capabilities than Sonnet, yet it is approximately 67% more expensive, priced at $5 per million input tokens and $25 per million output tokens. Your selection of model and provider hinges partly on the volume of testing you undertake, your preference for a specific LLM in comprehending your work (Claude enjoys a strong reputation for R code generation), and the criticality of precise task evaluation. Monitor your usage vigilantly if cost is a concern. Should you wish to avoid any expenditure while following the examples in this tutorial and are amenable to employing less capable LLMs, consider GitHub Models, which offers a free tier. ellmer supports GitHub Models with chat_github(), and you can also ascertain available LLMs by executing models_github().

Keep a close watch on your usage if cost is a consideration. If you prefer not to incur expenses while following the examples in this tutorial and are open to utilizing less capable LLMs, explore GitHub Models, which provides a free tier. ellmer offers support for GitHub Models via chat_github(). (You can also view available LLMs by running models_github().)

Below, I have incorporated model_graded_qa() scoring into my_task and also assigned a name to the task. However, I advise against naming your task if you intend to clone it later to try a different model. Cloned tasks retain their original name, and as of this writing, there is no mechanism to alter it.

my_task <- Task$new(
dataset = my_dataset,
solver = generate(my_chat), # Using my_chat that was created above
scorer = model_graded_qa(
scorer_chat = chat_anthropic(model = "claude-opus-4-6")
),
name = "Basic gpt-5-nano test with Opus judging"
)

Now, my task is prepared for execution.

Execute your initial vitals task

You initiate a vitals task using the task object’s $eval() method:

my_task$eval()

The eval() method triggers five distinct methods: $solve(), $score(), $measure(), $log(), and $view(). Upon completion, a built-in log viewer should automatically appear. Click on the hyperlinked task to access further details:

Details on a task run in vitals’ built-in viewer. You can click each sample for additional info.
Sharon Machlis

“C” signifies correct and “I” denotes incorrect; a “P” could have been present for partially correct if I had enabled partial credit.

If you wish to access a log file in that viewer at a later time, you can re-invoke the viewer with vitals_view("your_log_directory"). The logs are merely JSON files, thus allowing for alternative viewing methods.

It is probable that you will desire to run an evaluation multiple times, rather than just once, to gain greater confidence in an LLM’s reliability and to ensure it didn’t merely succeed by chance. You can configure multiple runs using the epochs argument:

my_task$eval(epochs = 10)

The accuracy of the bar chart code across one of my 10-epoch runs stood at 70%—a figure that may or may not be deemed “sufficient.” On another occasion, it climbed to 90%. To obtain a genuine measure of an LLM’s performance, especially when it doesn’t consistently achieve 100% on every run, a substantial sample size is crucial; the margin of error can be considerable with only a few tests. (For an in-depth exploration of the statistical analysis of vitals results, consult the package’s analysis vignette.)

The cost for Sonnet 4.6 to act as a judge was approximately 14 cents, compared to 27 cents for Opus 4.6, across 11 total epoch runs each involving three queries. (It’s worth noting that not all these queries actually required an LLM for evaluation, had I chosen to segment the demo into multiple task objects. The sentiment analysis merely sought “Mixed,” which involves simpler scoring.)

The vitals package incorporates a function capable of formatting a task’s evaluation results into a data frame: my_task$get_samples(). If this formatting is desirable, save the data frame while the task remains active in your R session:

results_df <- my_task$get_samples()
saveRDS(results_df, "results1.Rds")

You might also consider saving the Task object itself.

Should an API anomaly occur during the execution of your input queries, the entire run will fail. If you plan to conduct a test for numerous epochs, it might be prudent to divide it into smaller batches to mitigate the risk of wasting tokens (and time).

Substitute a different LLM

There are several approaches to executing the same task with an alternative model. First, instantiate a new chat object with that different model. Here is the code for exploring Google Gemini 3 Flash Preview:

my_chat_gemini <- chat_google_gemini(model = "gemini-3-flash-preview")

You can then execute the task in one of three ways.

Clone an existing task and assign the chat as its solver using $set_solver():

my_task_gemini <- my_task$clone()
my_task_gemini$set_solver(generate(my_chat_gemini))
my_task_gemini$eval(epochs = 3)

Clone an existing task and incorporate the new chat as a solver when you initiate its run:

my_task_gemini <- my_task$clone()
my_task_gemini$eval(epochs = 3, solver_chat = my_chat_gemini)

Create a new task from the ground up, which permits the inclusion of a fresh name:

my_task_gemini <- Task$new(
dataset = my_dataset,
solver = generate(my_chat_gemini),
scorer = model_graded_qa(
partial_credit = FALSE,
scorer_chat = ellmer::chat_anthropic(model = "claude-opus-4-6")
),
name = "Gemini flash 3 preview"
)
my_task_gemini$eval(epochs = 3)

Ensure you have configured your API key for each provider you intend to test, unless you are utilizing a platform that does not require them, such as local LLMs with ollama.

Examine multiple task executions

Once you have executed multiple tasks with various models, you can employ the vitals_bind() function to consolidate the results:

both_tasks <- vitals_bind(
gpt5_nano = my_task,
gemini_3_flash = my_task_gemini
)

Example of combined task results running each LLM with three epochs.
Sharon Machlis

This operation yields an R data frame featuring columns for task, id, epoch, score, and metadata. The metadata column contains a data frame within each row, with columns for input, target, result, solver_chat, scorer_chat, scorer_metadata, and scorer.

To flatten the input, target, and result columns and enhance their scannability and analytical utility, I un-nested the metadata column as follows:

library(tidyr)
both_tasks_wide
unnest_longer(metadata) |>
unnest_wider(metadata)

I was then able to execute a rapid script to iterate through each bar-chart result code and observe its output:

library(dplyr)

# Some results are surrounded by markdown and that markdown code needs to be removed or the R code won't run
extract_code <- function(text) {
# Remove markdown code fences
text <- gsub("r\\n|\\n|", "", text)
text <- trimws(text)
text
}

# Filter for barchart results only
barchart_results
filter(id == "barchart")

# Loop through each result
for (i in seq_len(nrow(barchart_results))) {
code_to_run <- extract_code(barchart_results$result[i])
score <- as.character(barchart_results$score[i])
task_name <- barchart_results$task[i]
epoch <- barchart_results$epoch[i]

# Display info
cat("\n", strrep("=", 60), "\n")
cat("Task:", task_name, "| Epoch:", epoch, "| Score:", score, "\n")
cat(strrep("=", 60), "\n\n")

# Try to run the code and print the plot
tryCatch(
{
plot_obj <- eval(parse(text = code_to_run))
print(plot_obj)
Sys.sleep(3)
},
error = function(e) {
cat("Error running code:", e$message, "\n")
Sys.sleep(3)
}
)
}

cat("\nFinished displaying all", nrow(barchart_results), "bar charts.\n")

Test local LLMs

This represents one of my most favored applications for vitals. Presently, models compatible with my PC’s 12GB of GPU RAM are somewhat constrained. Nevertheless, I anticipate that smaller models will soon become sufficiently capable for additional tasks I wish to perform locally with sensitive data. Vitals facilitates effortless testing of new LLMs against my specific use cases.

Vitals (via ellmer) provides support for ollama, a widely adopted method for running LLMs locally. To utilize ollama, download, install, and execute the ollama application, then either use the desktop app or a terminal window to run it. The syntax is ollama pull to download an LLM, or ollama run to both download and initiate a chat if you wish to confirm the model’s functionality on your system. For instance: ollama pull ministral-3:14b.

The rollama R package allows you to download a local LLM for ollama from within R, provided ollama is running. The syntax is rollama::pull_model("model-name"). For example, rollama::pull_model("ministral-3:14b"). You can verify if R can detect ollama running on your system with rollama::ping_ollama().

I also retrieved Google’s gemma3-12b and Microsoft’s phi4, subsequently creating tasks for each using the same dataset as before. Note that as of this writing, the development version of vitals is required to handle LLM names containing colons (the next CRAN version after 0.2.0 is expected to support this):

# Create chat objects
ministral_chat <- chat_ollama(
model = "ministral-3:14b"
)

gemma_chat <- chat_ollama(
model = "gemma3:12b"
)

phi_chat <- chat_ollama(
model = "phi4"
)

# Create one task with ministral, without naming it
ollama_task <- Task$new(
dataset = my_dataset,
solver = generate(ministral_chat),
scorer = model_graded_qa(
scorer_chat = ellmer::chat_anthropic(model = "claude-opus-4-6")
)
)

# Run that task object's evals
ollama_task$eval(epochs = 5)

# Clone that task and run it with different LLM chat objects
gemma_task <- ollama_task$clone()
gemma_task$eval(epochs = 5, solver_chat = gemma_chat)

phi_task <- ollama_task$clone()
phi_task$eval(epochs = 5, solver_chat = phi_chat)

# Turn all these results into a combined data frame
ollama_tasks <- vitals_bind(
ministral = ollama_task,
gemma = gemma_task,
phi = phi_task
)

All three local LLMs accurately performed the sentiment analysis, but all exhibited poor performance on the bar chart task. Some code successfully produced bar charts, though without axes flipped and sorted in descending order; other code was entirely non-functional.

Results of one run of my dataset with five local LLMs.
Sharon Machlis

R code for the results table above:

library(dplyr)
library(gt)
library(scales)

# Prepare the data
plot_data
rename(LLM = task, task = id) |>
group_by(LLM, task) |>
summarize(
pct_correct = mean(score == "C") * 100,
.groups = "drop"
)

color_fn <- col_numeric(
palette = c("#d73027", "#fc8d59", "#fc8d59", "#fee08b", "#1a9850"),
domain = c(0, 20, 40, 60, 100)
)

plot_data |>
tidyr::pivot_wider(names_from = task, values_from = pct_correct) |>
gt() |>
tab_header(title = "Percent Correct") |>
cols_label(sentiment-analysis= html("sentiment- analysis")) |>
data_color(
columns = -LLM,
fn = color_fn
)

It cost me 39 cents for Opus to evaluate these local LLM runs—a rather reasonable cost.

Extract structured data from text

Vitals features a specialized function for extracting structured data from plain text: generate_structured(). This function necessitates both a chat object and a defined data type that the LLM is expected to return. As of this writing, the development version of vitals is required to use the generate_structured() function.

First, here is my new dataset designed to extract the topic, speaker name and affiliation, date, and start time from a plain-text description. The more complex version instructs the LLM to convert the time zone from Central European Time to Eastern Time:

extract_dataset <- data.frame(
id = c("entity-extract-basic", "entity-extract-more-complex"),
input = c(
"Extract the workshop topic, speaker name, speaker affiliation, date in 'yyyy-mm-dd' format, and start time in 'hh:mm' format from the text below. Assume the date year makes the most sense given that today's date is February 7, 2026. Return ONLY those entities in the format {topic}, {speaker name}, {date}, {start_time}. R Package Development in Positron\r\nThursday, January 15th, 18:00 - 20:00 CET (Rome, Berlin, Paris timezone) \r\nStephen D. Turner is an associate professor of data science at the University of Virginia School of Data Science. Prior to re-joining UVA he was a data scientist in national security and defense consulting, and later at a biotech company (Colossal, the de-extinction company) where he built and deployed scores of R packages.",

`"Extract the workshop topic, speaker name, speaker affiliation, date in 'yyyy-mm-dd' format, and start time in Eastern Time zone in 'hh:mm ET' format from the text below. (TZ is the time zone). Assume the date year makes the most sense given that today's date is February 7, 2026. Return ONLY those entities in the format {topic}, {speaker name}, {date}, {start_time}. Convert the given time to Eastern Time if required. R Package Development in Positron\r\nThursday, January 15th, 18:00 - 20:00 CET (Rome, Berlin, Paris timezone) \r\nStephen D. Turner is an associate professor of data science at the University of Virginia School of Data Science. Prior to re-joining UVA he was a data scientist in national security and defense consulting, and later at a biotech company (Colossal, the de-extinction company) where he built and deployed scores of R packages."`
),

target = c(

"R Package Development in Positron, Stephen D. Turner, University of Virginia (or University of Virginia School of Data Science), 2026-01-15, 18:00. OR R Package Development in Positron, Stephen D. Turner, University of Virginia (or University of Virginia School of Data Science), 2026-01-15, 18:00 CET.",

"R Package Development in Positron, Stephen D. Turner, University of Virginia (or University of Virginia School of Data Science), 2026-01-15, 12:00 ET."

)

)
Next, I will define a data structure using ellmer’s type_object() function. Each argument specifies the name of a data field and its corresponding type (string, integer, etc.). I am indicating my desire to extract a workshop_topic, speaker_name, current_speaker_affiliation, date (as a string), and start_time (also as a string):
my_object <- type_object(

workshop_topic = type_string(),

speaker_name = type_string(),

current_speaker_affiliation = type_string(),

date = type_string(

"Date in yyyy-mm-dd format"

),

start_time = type_string(

"Start time in hh:mm format, with timezone abbreviation if applicable"

)

)
Subsequently, I will employ the chat objects I previously created in a new structured data task, using Sonnet as the judge given the straightforward grading process:
my_task_structured <- Task$new(

dataset = extract_dataset,

solver = generate_structured(

solver_chat = my_chat,

type = my_object

),

scorer = model_graded_qa(

partial_credit = FALSE,

scorer_chat = ellmer::chat_anthropic(model = "claude-sonnet-4-6")

)

)
gemini_task_structured <- my_task_structured$clone()
# You need to add the type to generate_structured(), that's not included when a structured task is cloned

gemini_task_structured$set_solver(

generate_structured(solver_chat = my_chat_gemini, type = my_object)

)
ministral_task_structured <- my_task_structured$clone()

ministral_task_structured$set_solver(

generate_structured(solver_chat = ministral_chat, type = my_object)

)
phi_task_structured <- my_task_structured$clone()

phi_task_structured$set_solver(

generate_structured(solver_chat = phi_chat, type = my_object)

)
gemma_task_structured <- my_task_structured$clone()

gemma_task_structured$set_solver(

generate_structured(

solver_chat = gemma_chat,

type = my_object

)

)
# Run the evaluations!
my_task_structured$eval(epochs = 3)

gemini_task_structured$eval(epochs = 3)

ministral_task_structured$eval(epochs = 3)

gemma_task_structured$eval(epochs = 3)

phi_task_structured$eval(epochs = 3)
# Save results to data frame

structured_tasks <- vitals_bind(

gemini = gemini_task_structured,

gpt_5_nano = my_task_structured,

ministral = ministral_task_structured,

gemma = gemma_task_structured,

phi = phi_task_structured

)
saveRDS(structured_tasks, "structured_tasks.Rds")
It cost me 16 cents for Sonnet to judge 15 evaluation runs, each consisting of two queries and their corresponding results.
Here are the outcomes:
How various LLMs fared on extracting structured data from text.

Sharon Machlis
I was astonished that a local model, Gemma, attained a 100% score. Curious if this was an anomaly, I ran the evaluation an additional 17 times, bringing the total to 20. Peculiarly, it made two errors out of 20 basic extractions by labeling the title “R Package Development” instead of “R Package Development in Positron,” yet it achieved 100% on the more complex tasks. I consulted Claude Opus about this, and it explained that my “easier” task was more ambiguous for a less capable model to interpret. A critical takeaway: be as precise as possible in your instructions!
Nevertheless, Gemma’s results on this task were sufficiently promising for me to consider testing it on actual real-world entity extraction scenarios. And I would not have discovered this without conducting automated evaluations on multiple local LLMs.
Conclusion
If you are accustomed to writing code that delivers predictable, consistent responses, a script that generates varying answers each time it executes can feel unsettling. While absolute guarantees regarding an LLM’s subsequent response are elusive, evaluations can bolster your confidence in your code by enabling you to conduct structured tests with quantifiable outcomes, as opposed to relying on manual, ad-hoc queries. Furthermore, as the model landscape continually evolves, you can remain current by assessing the performance of newer LLMs—not against generalized benchmarks, but specifically on the tasks that hold the most significance for you.
Discover more about the vitals R package
Visit the vitals package website.
Utilize the are dataset on GitHub to conduct evaluations on various LLMs, observing their proficiency in generating R code.
View Simon Couch’s presentation at posit::conf(2025).
                    R LanguageProgramming LanguagesSoftware DevelopmentArtificial IntelligenceGenerative AIMachine Learning                 
Share this:

				Click to share on Facebook (Opens in new window)
				Facebook
			

				Click to share on X (Opens in new window)
				X

Trending →

Steady iOS Adoption

Microsoft Says Glass Could Store Data for Over 10,000 Years

5 Ways Gemini Can Make Your Google Slides Shine

Instant Personalization: A Developer’s How-To

Community Fights to Free MySQL From Oracle’s Stranglehold

Pick the Perfect LLM with R and Performance Data

Leave a Reply Cancel reply

You Might Also Like ↷

Your Task Manager Just Got an AI Assistant

Smart Data Cleaning

Unlock Real-time Microservices with SQL

Why You Can’t Quit Them: The Business of Loyalty