Dataocean AI Has Participated in Creating the Open-Source Dataset GigaSpeech 2: A Large-Scale and Multi-Domain ASR Corpus for Low-Resource Languages

24.9.2024 21:00:00 CEST | Business Wire | Press Release

Dataocean AI has collaborated with Shanghai Jiao Tong University, The Chinese University of Hong Kong, Tsinghua University, Pengcheng Lab, AISpeech, Birch AI, and Seasalt AI to successfully develop GigaSpeech 2. The development and test sets of GigaSpeech 2 are labeled by a professional team from Dataocean AI.

This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20240924609911/en/

(Photo: Business Wire)

GigaSpeech 2 Overview

GigaSpeech 2 is an ever-expanding, large-scale, multi-domain, and multilingual speech recognition corpus designed to promote research and development in low-resource language speech recognition. GigaSpeech 2 raw contains 30,000 hours of automatically transcribed audio, covering Thai, Indonesian, and Vietnamese. After multiple rounds of refinement and iteration, GigaSpeech 2 refined offers 10,000 hours of Thai, 6,000 hours of Indonesian, and 6,000 hours of Vietnamese. The test sets labeled by Dataocean AI for Thai and Indonesian, each consist of 10 hours, while the development sets are 10 hours for Thai and Indonesian. The team have also open-sourced multilingual speech recognition models trained on the GigaSpeech 2 data, achieving performance comparable to commercial speech recognition services.

Dataset Construction

The construction process of GigaSpeech 2 has also been open-sourced. This is an automated process for building large-scale speech recognition datasets from vast amounts of unlabeled audio available on the internet. The automated process involves data crawling, transcription, alignment, and refinement. Initially, Whisper is used for preliminary transcription, followed by forced alignment with TorchAudio to produce GigaSpeech 2 raw through multi-dimensional filtering. The dataset is then refined iteratively using an improved Noisy Student Training (NST) method, enhancing the quality of pseudo-labels through repeated iterations, ultimately resulting in GigaSpeech 2 refined.

GigaSpeech 2 encompasses a wide range of thematic domains, including agriculture, art, business, climate, culture, economics, education, entertainment, health, history, literature, music, politics, relationships, shopping, society, sports, technology, and travel. Additionally, it covers various content formats such as audiobooks, documentaries, lectures, monologues, movies and TV shows, news, interviews, and video blogs.

Training Set Details

GigaSpeech 2 offers a comprehensive and diverse training set, which is meticulously designed to support the development of robust and high-performing speech recognition models. The training set details are as follows:

- Thai: The raw version consists of 12,901.8 hours of speech data, while the refined version encompasses 10,262.0 hours.
- Indonesian: The raw data amounts to 8,112.9 hours, and the refined data comprises 5,714.0 hours.
- Vietnamese: The raw dataset includes 7,324.0 hours of speech recordings, with the refined dataset totaling 6,039.0 hours.

Development and Test Set Details

Dataocean AI’s COO - Ke Li, who is also one of the paper's authors, has led GigaSpeech 2 test sets project. With nearly 20 years of project experience, the team has contributed in Thai and Indonesian with word accuracy of over 97%. Besides those two East Asian languages, Dataocean AI’s team can also cover over 200 languages and dialects around the world. The company offer 1600+ high-quality off-the-shelf datasets are applicable for multiple scenarios such as Generative AI, Autonomous driving, Smart home, Customer services and etc., fulfilling the evolving needs of the AI industry.

Experimental Results

We conducted a comparative evaluation of speech recognition models trained on the GigaSpeech 2 dataset against industry-leading models, including OpenAI Whisper (large-v3, large-v2, base), Meta MMS L1107, Azure Speech CLI 1.37.0, and Google USM Chirp v2. The comparison was carried out in Thai, Indonesian, and Vietnamese languages. Performance evaluation was based on three test sets: GigaSpeech 2, Common Voice 17.0, and FLEURS, using Character Error Rate (CER) or Word Error Rate (WER) as metrics. The results indicate:

Thai: Our model demonstrated exceptional performance, surpassing all competitors, including commercial interfaces from Microsoft and Google. Notably, our model achieved this significant result while having only one-tenth the number of parameters compared to Whisper large-v3.

Indonesian and Vietnamese: Our system exhibited competitive performance compared to existing baseline models in both Indonesian and Vietnamese languages.

Resource Links

The GigaSpeech 2 dataset is now available for download:
https://huggingface.co/datasets/speechcolab/gigaspeech2

The automated process for constructing large-scale speech recognition datasets is available at:
https://github.com/SpeechColab/GigaSpeech2

The preprint paper is available at:
https://arxiv.org/pdf/2406.11546

Dataocean AI website:
https://www.dataoceanai.com

View source version on businesswire.com: https://www.businesswire.com/news/home/20240924609911/en/

Contacts

contact@dataoceanai.com

Business Wire, a Berkshire Hathaway company, is the global leader in multiplatform press release distribution.

Subscribe to releases from Business Wire

Subscribe to all the latest releases from Business Wire by registering your e-mail address below. You can unsubscribe at any time.

Latest releases from Business Wire

Incyte’s Pivotal frontMIND Trial Showed Tafasitamab (Monjuvi ® /Minjuvi ® ) Combination Significantly Prolonged Progression-free Survival, Reducing the Risk of Disease Progression or Death by 25% in Patients with Previously Untreated, High-risk DLBCL30.5.2026 14:00:00 CEST | Press Release

Incyte (Nasdaq:INCY) today announced positive results from the pivotal Phase 3 frontMIND trial evaluating the efficacy and safety of tafasitamab (Monjuvi®/Minjuvi®), a humanized Fc-modified cytolytic CD19-targeting monoclonal antibody, and lenalidomide added to R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine and prednisone; Tafa-Len-R-CHOP) versus R-CHOP alone as a first-line treatment for adults with previously untreated diffuse large B-cell lymphoma (DLBCL) or high-grade B-cell lymphoma (HGBL). Eligible patients had an International Prognostic Index (IPI) score of 3-5, or, for patients ≤60 years of age, an age-adjusted IPI (aaIPI) of 2-3. The oral presentation of these data is taking place at the 2026 American Society of Clinical Oncology (ASCO) Annual Meeting being held May 29 – June 2, 2026, in Chicago (Abstract #LBA7000. Session: Oral Abstract Session – Hematologic Malignancies – Lymphoma and Chronic Lymphocytic Leukemia. May 30, 4:00 – 7:00 p.m. ET [3:00 – 6:00 p.m.

Fortegra Completes Acquisition by DB Insurance29.5.2026 22:30:00 CEST | Press Release

The Fortegra Group, Inc. ("Fortegra"), a global specialty insurance company, today announced the completion of its acquisition by DB Insurance Co., Ltd. ("DB"), one of Korea's leading property and casualty insurers. The transaction, announced on September 26, 2025, received all required regulatory and stockholder approvals. Fortegra will operate independently, maintaining its existing leadership team, distribution relationships, and underwriting discipline. Agents, distribution partners, and customers will continue to experience the service excellence that has defined the Fortegra experience. Richard Kahlbaugh, Chairman and CEO of Fortegra, said: "Every company eventually changes ownership. That is the nature of business. The closing of this acquisition is a starting point. As part of DB Insurance, Fortegra is positioned to expand our business geographically, enhance our capabilities and deepen our market presence in the US, Europe, the United Kingdom and Asia. Together, DB Insurance a

SINOVAC Receives Nasdaq Notification Regarding Late Filing of 2025 Annual Report29.5.2026 22:01:00 CEST | Press Release

Sinovac Biotech Ltd. (Nasdaq: SVA) (“SINOVAC” or the “Company”), a leading provider of biopharmaceutical products in China, today announced that it received a notification letter dated May 20, 2026 (the “Notification Letter”), from Nasdaq Listing Qualifications (“Nasdaq”) stating that as of May 8, 2026, the Company had regained compliance with the periodic filing and interim financial requirements in Nasdaq Listing Rules 5250(c)(1) (the “Periodic Filing Rule”) and 5250(c)(2), as required by the Panel’s decision dated January 21, 2026. As previously disclosed on January 22, 2026, under the Panel’s decision, SINOVAC was required to, on or before May 11, 2026, demonstrate compliance with such Nasdaq Listing Rules by completing filings of its annual report for the year ended December 31, 2024, on Form 20-F and an interim balance sheet and income statement as of the end of its second quarter of 2025 on Form 6-K. The Company timely completed such filings as required by the Panel’s decision.

From Network Automation to Agentic NetOps: NetBrain Sets the Standard for Deploying AI in Network Operations29.5.2026 15:00:00 CEST | Press Release

NetBrain Technologies, Inc. today announced major new platform features that advance Agentic NetOps from an emerging category to operational reality. NetBrain's clients are already deploying agents that are diagnosing and remediating issues across complex multi-vendor enterprise networks. These new features further extend the platform with new agent tooling, cross-domain context, and open interfaces for the broader agentic enterprise. Early customer outcomes show the magnitude of the shift: A leading health insurer used NetBrain's Deep Diagnosis agent to diagnose and resolve a weeks old VPN connectivity issue in under five minutes. A large manufacturer resolved a critical device issue with a single prompt, isolating the root cause across the network path in under 20 minutes, saving hundreds of hours of engineer time, shrinking MTTR by more than 95%. A global telecommunications firm found NetBrain's context-grounded agents outperformed a stand-alone frontier LLM on a persistent firewall

Adtran resolves long-running patent litigation, reinforcing commitment to defend innovation29.5.2026 14:00:00 CEST | Press Release

Adtran today announced it has resolved a patent litigation matter, resulting in a full settlement and dismissal of all claims with prejudice. The case, initiated in 2020 by a non-practicing entity asserting five patents, was transferred to the US District Court for the Northern District of Alabama in 2021 following a successful motion by Adtran. Adtran subsequently filed counterclaims, including bad-faith patent assertion under Alabama statutory law. The settlement includes payment to Adtran to resolve its counterclaims. Terms of the agreement remain confidential. “This outcome reflects a disciplined and consistent approach to protecting our innovation and our customers,” said Justin Ferguson, SVP and general counsel at Adtran. “We take all claims seriously, but we will not hesitate to defend ourselves when assertions lack merit. Situations like this place unnecessary strain on technology providers and divert resources from advancing networks and services. By advancing our counterclaim

In our pressroom you can read all our latest releases, find our press contacts, images, documents and other relevant information about us.

Visit our pressroom